Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75225746
  
    Don't you always have to look at the data to determine how many unique 
values a column has, regardless of type? String and int are encodings, but 
attribute types like categorical and continuous are interpretations. Those seem 
orthogonal to me and I thought `Attribute` was only metadata representing the 
attribute type, whereas the RDD schema already knows the actual column data 
types. That is, there's no "string" attribute feature type right?
    
    From a user perspective, I'd be surprised if I had to encode categorical 
string values as it seems like something a framework can easily do for me. 
There's nothing inherently strange about providing a string-valued column that 
is (necessarily) categorical and in fact they often are strings in input. But 
there's no reason it couldn't encode the categorical values internally in a 
different way if needed. Are you just referring to the latter? sure, anything 
can be done within the framework.
    
    So why can't a vector-valued column contain a string -- is the issue that 
internally we can't have a single array-valued column with different data 
types? that makes sense. Vector-valued columns are usually used for things like 
a large number of indicator variables, or related counts, which are all of the 
same data type, and not typically to encode complex nested schemas containing 
different data types. But if you really wanted a vector-valued column 
containing continuous age and categorical gender, in that case yes it seems 
like the data representation limitations would demand they're of the same data 
type, and must be doubles, and that's fine. But does mean you can't ever have a 
non-vector string-valued categorical feature?
    
    I agree that discrete and ordinal don't come up. I don't think they're 
required as types, but may allow optimizations. For example, there's no point 
in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. 
They're all the same rule. That optimization doesn't exist yet. I can't 
actually think of a realistic optimization that would depend on knowing a value 
is ordinal (1,2,3,...) I'd drop that maybe.
    
    OK I will get to work on changes discussed so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to