Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4460#issuecomment-75225746 Don't you always have to look at the data to determine how many unique values a column has, regardless of type? String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought `Attribute` was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types. That is, there's no "string" attribute feature type right? From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework. So why can't a vector-valued column contain a string -- is the issue that internally we can't have a single array-valued column with different data types? that makes sense. Vector-valued columns are usually used for things like a large number of indicator variables, or related counts, which are all of the same data type, and not typically to encode complex nested schemas containing different data types. But if you really wanted a vector-valued column containing continuous age and categorical gender, in that case yes it seems like the data representation limitations would demand they're of the same data type, and must be doubles, and that's fine. But does mean you can't ever have a non-vector string-valued categorical feature? I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe. OK I will get to work on changes discussed so far.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org