Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75223695
@srowen If we mark a string column categorical, it may be hard to answer
how many categories it has without looking at the data. If a column is marked
categorical, it should store the information of categories in the metadata and
store only integer/double in the column. The difference is that we cannot merge
a string column with a double column into a vector column without turning the
string column into an integer/double column. So I feel that we should treat
string column and categorical column separately.
For the type hierarchy, if we add a type, let's discuss what algorithms
would actually use it:
1. numeric (slightly broader than continuous): linear algorithms can take
them, while decision tree needs to bin them first (if the number of distinct
values are large)
2. categorical: decision tree can take them, while linear algorithms needs
dummy coding.
3. binary: both linear algorithms and decision tree can take them directly.
4. discrete/ordinal: ?
Maybe for the base `Attribute` trait, we can have `cardinality: Int`. For
continuous data, we put `inf` or `-1`, for ordinal data, we put the number of
distinct values. Let's try to write down the API and see how it looks.
Not a concern at this stage, for `AttributeGroup` we might need to handle
the storage efficiently to avoid GC pressure. It may come out with millions of
features, and we will hit GC if we use too many objects to store the
attributes. Well, this is maybe too early to worry about.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]