Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75352674
+1 for the feature assembler or some other algorithm handling munging and
indexing as needed.
* Note that the behavior of the assembler may depend on the algorithm being
used. E.g., an assembler may want to use 1-hot encoding for Strings for linear
regression, but use simple indexing for trees. That makes it awkward for the
user, and we may eventually want each algorithm to handle its own feature
assembly if needed.
About categorical types for decision trees: There should ideally be a
distinction between categorical types with arbitrary values and categorical
types known to be in a range {0, 1, ..., numCategories-1}.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]