Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3000#issuecomment-62630207
@sryza Hi, yes, I didn't realize that they shared some functionality. It
would be great to coordinate. I think these 2 types of feature transformations
are pretty different, but there is some shared underlying functionality.
Feature operations:
* Decide which features should be categorical (this PR)
* Relabel categorical feature values based on an index (this PR)
* Create new features by expanding a categorical feature (your PR)
* Count statistics about dataset columns (both PRs)
The first 3 operations seem fairly distinct to me. But the last one (which
does not really need to be exposed to users) could definitely be shared.
We both need to know how many distinct values there are in a column, with
some extra options. (You need to specify a subset of columns, and I need to
limit the number of distinct values at some point.) Perhaps we could combine
these into some sort of stats collector (maybe private[mllib] for now?) which
we can both use. I'd be happy to do that, or let me know if you'd like to.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]