Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4735#issuecomment-75877168
  
    The `label` in the name is quite general. It could be labels used in 
classification or just an arbitrary column with string labels. This is the same 
as `LabelEncoder` in sklearn. The `LabelIndexer` should also create ML 
attributes using the labels.
    
    #3000 takes an `RDD[Vector]` and tries to decide which columns should be 
categorical. I don't see an overlap in terms of functionality. If the plan is 
to make `DatasetIndexer` a feature transformer that handles both `string labels 
-> indices` and `numeric -> categorical`, we can definitely go with this 
direction. Btw, having a single `maxCategories` might cause unexpected 
problems, e.g., treating categorical features as continuous. We may still need 
to ask users to specify which columns are categorical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to