Github user dputler commented on the pull request:

    https://github.com/apache/spark/pull/7987#issuecomment-132909037
  
    I'm not clear as to how the order operation is determined. Looking at the 
tests, in the case of a categorical interaction it appears that it is based on 
the order in which unique category values are encountered for a categorical 
variable. Specifically, for the numeric/categorical interaction, the last 
category encountered ("baz") provides the first values of the interaction 
values, and the first category encountered ("foo") provides the last values of 
the interaction. In contrast, for the interaction between two categorical 
variables, the column order is set by the first category of the second 
underlying categorical variable (the value zq) is primary in column ordering 
(with zq-bar being the first column), so order is used again, but it runs in 
opposite direction for the two variables. This structure will actually work 
fine for model training, however, things get more complicated for predicting 
new data with this model. The approach is basically the same approach as 
MS/Revolution
  uses in their Revo ScaleR package (i.e., the order of the categories depends 
on when they are first encountered in the data). However, this turns out to 
greatly complicate predicting new data with a Revo ScaleR model in practice. 
Open source R works by first determining all the category labels for each 
categorical variable, alphabetically sorts the unique label for each 
categorical variable, and then basis the new feature order on the alphabetical 
sort of category labels, so the order in which a category label is encountered 
does not matter. This turns out to make dealing with predicting new data with 
an existing model much easier. The cost is the data needs to be passed over 
twice, with the first determining the set of unique category labels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to