Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-74623603
There are two types of `Attribute(s)`: describing a feature group (a vector
column) or describing a single feature (a scalar column). For a feature group,
the column name becomes the group name and individual features inside this
group may have their own names. For example, we have a vector column called
`user` and inside this feature group we can have features named `age` and
`gender`. When we merge multiple groups into a single feature vector, e.g., in
a feature vector assembler, the names are flattened like `user:age` and
`user:gender`. This answers @sryza 's question about one-hot-encoding. Assume
that the input column is a scalar column called "country" with categories
stored in the attribute. Then OneHotEncoder will output a vector column and
generate feature attributes with names like `country:US`, `country:CA`, etc.
+1 on @jkbradley 's suggestion about not calling it `FeatureAttribute`.
`Attribute` should be okay to describe a scalar column but we also need a name
to describe a vector column, where `Attributes` may sounds a little confusing.
I suggest `AttributeGroup`.
We don't need to care about the `FeatureType` in `mllib.tree` in this PR.
Once we have this PR merged, we can migrate the decision tree code.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]