Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75366539
@mengxr Understood about being told the number of distinct values for a
column by the caller/schema. I thought you were saying this was a difference
between string / integer columns. Yes, the point of this value is that it can
be fed in as metadata.
I think we are saying the same thing regarding strings. Yes, there's no
point in every algorithm handling string types. It's fine if they only consume
numeric types, and some separate optional transformation stage re-encodes
strings if needed. That's a transformation stage the framework could provide,
for users to drop in. Algorithms could indeed refuse to operate on string types.
To me, that means that the idea of a string-valued categorical column still
has a place in the representation since it exists at some stage of a pipeline.
It's just that such a thing would never reach an algorithm as-is. Is that
aligned with what you guys think?
Let me get down to the code to see if this actually matters to the
metadata. Right now nothing about this PR has to do with the underlying data
type. As long as you aren't saying "string" is a type mutually exclusive with
"categorical" then I think we're saying the same thing.
Yes I think my tree example wasn't a good one, nevermind. The algorithm is
already going to choose only actual data values as split points. Hm. Might
matter to algorithms that want to restrict themselves to discrete inputs, like
multinomial naive bayes? even there I don't think it's an example of requiring
the distinction. I'm neutral about adding 'discrete' as a type I suppose.
Interesting point: are numeric categorical values always assumed to be in
{0, 1, 2, ..., n-1}? That strikes me as something that is often true, not
always. A particular algorithm may require it, and may provide an optional
transformation to put a feature into this form. It doesn't seem like a
conceptually different type as much as an additional restriction on a
categorical feature's representation? that is, is this maybe an additional bit
of metadata? Hm.
Let me make some code changes to reflect the discussion so far.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]