Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75223695
  
    @srowen If we mark a string column categorical, it may be hard to answer 
how many categories it has without looking at the data. If a column is marked 
categorical, it should store the information of categories in the metadata and 
store only integer/double in the column. The difference is that we cannot merge 
a string column with a double column into a vector column without turning the 
string column into an integer/double column. So I feel that we should treat 
string column and categorical column separately.
    
    For the type hierarchy, if we add a type, let's discuss what algorithms 
would actually use it:
    
    1. numeric (slightly broader than continuous): linear algorithms can take 
them, while decision tree needs to bin them first (if the number of distinct 
values are large)
    2. categorical: decision tree can take them, while linear algorithms needs 
dummy coding.
    3. binary: both linear algorithms and decision tree can take them directly.
    4. discrete/ordinal: ?
    
    Maybe for the base `Attribute` trait, we can have `cardinality: Int`. For 
continuous data, we put `inf` or `-1`, for ordinal data, we put the number of 
distinct values. Let's try to write down the API and see how it looks.
    
    Not a concern at this stage, for `AttributeGroup` we might need to handle 
the storage efficiently to avoid GC pressure. It may come out with millions of 
features, and we will hit GC if we use too many objects to store the 
attributes. Well, this is maybe too early to worry about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to