[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

jkbradley Tue, 11 Nov 2014 14:03:21 -0800

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3000#issuecomment-62630207
  
    @sryza  Hi, yes, I didn't realize that they shared some functionality.  It 
would be great to coordinate.  I think these 2 types of feature transformations 
are pretty different, but there is some shared underlying functionality.
    Feature operations:
    * Decide which features should be categorical (this PR)
    * Relabel categorical feature values based on an index (this PR)
    * Create new features by expanding a categorical feature (your PR)
    * Count statistics about dataset columns (both PRs)
    The first 3 operations seem fairly distinct to me.  But the last one (which 
does not really need to be exposed to users) could definitely be shared.
    
    We both need to know how many distinct values there are in a column, with 
some extra options.  (You need to specify a subset of columns, and I need to 
limit the number of distinct values at some point.)  Perhaps we could combine 
these into some sort of stats collector (maybe private[mllib] for now?) which 
we can both use.  I'd be happy to do that, or let me know if you'd like to.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Reply via email to