Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75106841
> Rename FeatureType? and what's its value for AttributeGroup? GROUP or
null?
I wish we could use `type`, but it is already taken by Scala. `DataType` is
taken by SQL. So `DatumType` or `MLDataType`? ... I don't really have good
suggestions. I'm not sure whether we should make `AttributeGroup` an
`Attribute`. What is the benefit of making it an `Attribute`?
> You could imagine a more elaborate hierarchy of types: discrete is a
special case of continuous, ordinal is a special case of discrete. It's nice to
have that expressiveness; it adds somewhat to the complexity for the caller and
the code. Maybe you could argue that the schema should force an interpretation
for the algorithm. But I kind of like it. The type objects would have methods
like isContinuous, isCategorical. Should I make a fuller hierarchy or stick to
adding BINARY?
I think having a full hierarchy is a good idea. Could you list all of the
types you want to include? Then we can check the complexity. Btw, I don't know
whether we should have ML attributes attached to string columns. It seems to me
that a string column should be mapped to an integer column first to become an
ML column with attribute. Hopefully that reduces the complexity.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]