Github user dputler commented on the pull request:
https://github.com/apache/spark/pull/7987#issuecomment-132909037
I'm not clear as to how the order operation is determined. Looking at the
tests, in the case of a categorical interaction it appears that it is based on
the order in which unique category values are encountered for a categorical
variable. Specifically, for the numeric/categorical interaction, the last
category encountered ("baz") provides the first values of the interaction
values, and the first category encountered ("foo") provides the last values of
the interaction. In contrast, for the interaction between two categorical
variables, the column order is set by the first category of the second
underlying categorical variable (the value zq) is primary in column ordering
(with zq-bar being the first column), so order is used again, but it runs in
opposite direction for the two variables. This structure will actually work
fine for model training, however, things get more complicated for predicting
new data with this model. The approach is basically the same approach as
MS/Revolution
uses in their Revo ScaleR package (i.e., the order of the categories depends
on when they are first encountered in the data). However, this turns out to
greatly complicate predicting new data with a Revo ScaleR model in practice.
Open source R works by first determining all the category labels for each
categorical variable, alphabetically sorts the unique label for each
categorical variable, and then basis the new feature order on the alphabetical
sort of category labels, so the order in which a category label is encountered
does not matter. This turns out to make dealing with predicting new data with
an existing model much easier. The cost is the data needs to be passed over
twice, with the first determining the set of unique category labels.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]