Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75458264
  
    > To me, that means that the idea of a string-valued categorical column 
still has a place in the representation since it exists at some stage of a 
pipeline. It's just that such a thing would never reach an algorithm as-is. Is 
that aligned with what you guys think?
    
    I agree: ML types should be applicable to any SQL type, as long as it makes 
sense.
    
    > As long as you aren't saying "string" is a type mutually exclusive with 
"categorical" then I think we're saying the same thing.
    
    I think we are saying the same thing.  (There will be some mutually 
exclusive types, such as Strings not being continuous.)
    
    > Interesting point: are numeric categorical values always assumed to be in 
{0, 1, 2, ..., n-1}? 
    
    Algorithms which want 0-based indices for categories (for efficient vector 
indexing) could handle the re-indexing themselves, but it would be nice to 
encode it in metadata for the benefit of ensemble algorithms (where you would 
only want to do the re-indexing once).
    
    > @jkbradley The issue with algorithms handling munging and indexing is the 
increased complexity. 
    
    @mengxr  It won't really increase complexity of the code much since the 
same code could be re-used for all algorithms (with a few options for the 
feature types the algorithm can handle).  The main issue is the API:
    * Do users want to be able to call an algorithm on any dataset they load, 
without thinking about the feature types?
      * If so, does the implicit featurization belong in the algorithm (which 
knows what types it can take) or in a featurizer PipelineStage before the 
algorithm (where the user would have to specify feature types based on the 
algorithm being used)?
    * Or do we want to force users to examine the features and select types by 
hand before running an algorithm?
    
    I've argued for algorithms handling featurization before, but I can see 
reason in forcing users to know what they are doing.  This discussion may not 
belong in this PR anyways, since this functionality could be added later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to