[
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954185#comment-14954185
]
yuhao yang commented on SPARK-11069:
------------------------------------
I'll try to do it and test with several cases. Updates will be posted here if
anything unexpected found.
> Add RegexTokenizer option to convert to lowercase
> -------------------------------------------------
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Joseph K. Bradley
> Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer
> does not. It would be nice to add an option to RegexTokenizer to convert to
> lowercase. Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat
> upper/lower case differently.
> --> I'd vote for conversion before matching. If a user needs full control,
> they can convert to lowercase manually.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]