Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-78709488
@aborsu985 Sorry for the delay! On the high level, I'm a little concerned
with exposing too many parameters in the first version. NLTK's regex tokenizer
(http://www.nltk.org/_modules/nltk/tokenize/regexp.html) only has three
parameters: `pattern`, `gaps`, and `discard_empty`. There are trade-offs
between having a full featured implementation and maintenance cost. For
example, if we put `lowerCase` as a parameter here, what if a user just want to
convert something into lowercase letters? Should there be a separate
implementation of `LowerCase` as a transformer? Then we will have function
overlaps. Maybe we can remove `lowerCase` in this PR.
For other parameters, `minTokenLength` looks good to me, which is a
superset of NLTK's `discard_empty`. The name `matching` doesn't tell you what
happens if it is `false`. I don't think NLTK's `gaps` is doing a better job
either. But given the fact that there has been many NLTK users, using `gaps`
may be a better choice. Feel free to propose any names used in popular NLP
packages.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]