[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

mengxr Thu, 12 Mar 2015 17:17:20 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4504#issuecomment-78709488
  
    @aborsu985 Sorry for the delay! On the high level, I'm a little concerned 
with exposing too many parameters in the first version. NLTK's regex tokenizer 
(http://www.nltk.org/_modules/nltk/tokenize/regexp.html) only has three 
parameters: `pattern`, `gaps`, and `discard_empty`. There are trade-offs 
between having a full featured implementation and maintenance cost. For 
example, if we put `lowerCase` as a parameter here, what if a user just want to 
convert something into lowercase letters? Should there be a separate 
implementation of `LowerCase` as a transformer? Then we will have function 
overlaps. Maybe we can remove `lowerCase` in this PR.
    
    For other parameters, `minTokenLength` looks good to me, which is a 
superset of NLTK's `discard_empty`. The name `matching` doesn't tell you what 
happens if it is `false`. I don't think NLTK's `gaps` is doing a better job 
either. But given the fact that there has been many NLTK users, using `gaps` 
may be a better choice. Feel free to propose any names used in popular NLP 
packages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

Reply via email to