Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/4460#issuecomment-75304366
> Don't you always have to look at the data to determine how many unique
values a column has, regardless of type?
No if we already have ML attributes saved together with the data or defined
by users.
> String and int are encodings, but attribute types like categorical and
continuous are interpretations. Those seem orthogonal to me and I thought
Attribute was only metadata representing the attribute type, whereas the RDD
schema already knows the actual column data types.
Conceptually, this is true. But adding restrictions would simplify our
implementation. The restriction I proposed is that data stored in columns with
ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms
and transformers don't need to handle special types. Let's consider a vector
assembler that merges multiple columns into a vector column. If it needs to
handle string columns, it needs to call some indexer to turn strings into
indices and merge them. This piece of code would probably appear in every
algorithm and the unit test. If we force users turn string columns into numeric
ones first. The implementation of the rest of the pipelines could be simplified.
> From a user perspective, I'd be surprised if I had to encode categorical
string values as it seems like something a framework can easily do for me.
There's nothing inherently strange about providing a string-valued column that
is (necessarily) categorical and in fact they often are strings in input. But
there's no reason it couldn't encode the categorical values internally in a
different way if needed. Are you just referring to the latter? sure, anything
can be done within the framework.
scikit-learn has a clear separation of string values and numeric values.
All string values must be encoded into categorical columns through transformers
before calling ML algorithms, and all ML algorithms take a matrix `X` and a
vector `y`. That didn't surprise users much (hopefully). In MLlib, we will
provide transformers that turn strings into categorical columns in various ways.
> I agree that discrete and ordinal don't come up. I don't think they're
required as types, but may allow optimizations. For example, there's no point
in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature.
They're all the same rule. That optimization doesn't exist yet. I can't
actually think of a realistic optimization that would depend on knowing a value
is ordinal (1,2,3,...) I'd drop that maybe.
For trees, if the features are integers and their is a split `> 3.4`. Then
there won't be splits between 3 and 4 because all points are separated. It
looks okay to me that we have a split `> 3.4` while all values are integers. We
can definitely add this attribute back if it becomes necessary.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]