GitHub user aborsu985 opened a pull request:
https://github.com/apache/spark/pull/4504
RegEx Tokenizer for mllib
Added a Regex based tokenizer for mllib.
Currently the regex is fixed but if I could add a regex type paramater to
the paramMap,
changing the tokenizer regex could be a parameter used in the
crossValidation.
Also I wonder what would be the best way to add a stop word list.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aborsu985/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4504.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4504
----
commit 01cd26f856d7236035faf0c42f1f8f01ebbb2ce7
Author: Augustin Borsu <[email protected]>
Date: 2015-02-10T09:52:47Z
RegExTokenizer
A more complex tokenizer that extracts tokens based on a regex. It also
allows
to turn lowerCasing on and off, adding a minimum token length and a list of
stop words to exclude.
commit 9547e9df7f64c74f33526b26b92f6f1ef841ae3c
Author: Augustin Borsu <[email protected]>
Date: 2015-02-10T10:39:39Z
RegEx Tokenizer
A more complex tokenizer that extracts tokens based on a regex. It also
allows
to turn lowerCasing on and off, adding a minimum token length and a list of
stop words to exclude.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]