Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75458264
> To me, that means that the idea of a string-valued categorical column
still has a place in the representation since it exists at some stage of a
pipeline. It's just that such a thing would never reach an algorithm as-is. Is
that aligned with what you guys think?
I agree: ML types should be applicable to any SQL type, as long as it makes
sense.
> As long as you aren't saying "string" is a type mutually exclusive with
"categorical" then I think we're saying the same thing.
I think we are saying the same thing. (There will be some mutually
exclusive types, such as Strings not being continuous.)
> Interesting point: are numeric categorical values always assumed to be in
{0, 1, 2, ..., n-1}?
Algorithms which want 0-based indices for categories (for efficient vector
indexing) could handle the re-indexing themselves, but it would be nice to
encode it in metadata for the benefit of ensemble algorithms (where you would
only want to do the re-indexing once).
> @jkbradley The issue with algorithms handling munging and indexing is the
increased complexity.
@mengxr It won't really increase complexity of the code much since the
same code could be re-used for all algorithms (with a few options for the
feature types the algorithm can handle). The main issue is the API:
* Do users want to be able to call an algorithm on any dataset they load,
without thinking about the feature types?
* If so, does the implicit featurization belong in the algorithm (which
knows what types it can take) or in a featurizer PipelineStage before the
algorithm (where the user would have to specify feature types based on the
algorithm being used)?
* Or do we want to force users to examine the features and select types by
hand before running an algorithm?
I've argued for algorithms handling featurization before, but I can see
reason in forcing users to know what they are doing. This discussion may not
belong in this PR anyways, since this functionality could be added later on.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]