Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75366539
  
    @mengxr Understood about being told the number of distinct values for a 
column by the caller/schema. I thought you were saying this was a difference 
between string / integer columns. Yes, the point of this value is that it can 
be fed in as metadata.
    
    I think we are saying the same thing regarding strings. Yes, there's no 
point in every algorithm handling string types. It's fine if they only consume 
numeric types, and some separate optional transformation stage re-encodes 
strings if needed. That's a transformation stage the framework could provide, 
for users to drop in. Algorithms could indeed refuse to operate on string types.
    
    To me, that means that the idea of a string-valued categorical column still 
has a place in the representation since it exists at some stage of a pipeline. 
It's just that such a thing would never reach an algorithm as-is. Is that 
aligned with what you guys think?
    
    Let me get down to the code to see if this actually matters to the 
metadata. Right now nothing about this PR has to do with the underlying data 
type. As long as you aren't saying "string" is a type mutually exclusive with 
"categorical" then I think we're saying the same thing.
    
    Yes I think my tree example wasn't a good one, nevermind. The algorithm is 
already going to choose only actual data values as split points. Hm. Might 
matter to algorithms that want to restrict themselves to discrete inputs, like 
multinomial naive bayes? even there I don't think it's an example of requiring 
the distinction. I'm neutral about adding 'discrete' as a type I suppose.
    
    Interesting point: are numeric categorical values always assumed to be in 
{0, 1, 2, ..., n-1}? That strikes me as something that is often true, not 
always. A particular algorithm may require it, and may provide an optional 
transformation to put a feature into this form. It doesn't seem like a 
conceptually different type as much as an additional restriction on a 
categorical feature's representation? that is, is this maybe an additional bit 
of metadata? Hm.
    
    Let me make some code changes to reflect the discussion so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to