Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75225746
Don't you always have to look at the data to determine how many unique
values a column has, regardless of type? String and int are encodings, but
attribute types like categorical and continuous are interpretations. Those seem
orthogonal to me and I thought `Attribute` was only metadata representing the
attribute type, whereas the RDD schema already knows the actual column data
types. That is, there's no "string" attribute feature type right?
From a user perspective, I'd be surprised if I had to encode categorical
string values as it seems like something a framework can easily do for me.
There's nothing inherently strange about providing a string-valued column that
is (necessarily) categorical and in fact they often are strings in input. But
there's no reason it couldn't encode the categorical values internally in a
different way if needed. Are you just referring to the latter? sure, anything
can be done within the framework.
So why can't a vector-valued column contain a string -- is the issue that
internally we can't have a single array-valued column with different data
types? that makes sense. Vector-valued columns are usually used for things like
a large number of indicator variables, or related counts, which are all of the
same data type, and not typically to encode complex nested schemas containing
different data types. But if you really wanted a vector-valued column
containing continuous age and categorical gender, in that case yes it seems
like the data representation limitations would demand they're of the same data
type, and must be doubles, and that's fine. But does mean you can't ever have a
non-vector string-valued categorical feature?
I agree that discrete and ordinal don't come up. I don't think they're
required as types, but may allow optimizations. For example, there's no point
in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature.
They're all the same rule. That optimization doesn't exist yet. I can't
actually think of a realistic optimization that would depend on knowing a value
is ordinal (1,2,3,...) I'd drop that maybe.
OK I will get to work on changes discussed so far.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]