ozancicek commented on a change in pull request #24939: [SPARK-18569][ML][R] 
Support RFormula arithmetic, I() and spark functions
URL: https://github.com/apache/spark/pull/24939#discussion_r303856796
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
 ##########
 @@ -614,3 +652,80 @@ private object VectorAttributeRewriter extends 
MLReadable[VectorAttributeRewrite
     }
   }
 }
+
+/**
+ * Utility transformer for adding expressions to dataframe using `expr` spark 
function
+ *
+ * @param exprsToSelect set of string expressions to be added as a column to 
the dataframe.
+ *                      The name of the columns will be identical to the 
expression
+ */
+private class ExprSelector(
 
 Review comment:
   As you suspect, having an extra hidden stage isn't really essential here. I 
only added it to have less coupling between RFormula and RFormulaModel classes. 
This can be very well done without it. 
   
   Roughly, this is what RFormula and RFormulaModel classes are doing;
   ```scala
   
   class RFormula(..., formula) {
      def fit(df) = {
        val parsedFormula = parse(formula)
        var stages = ArrayBuffer()
   
        val featureColumns = parsedFormula.terms.map {
            ...
            stages += OneHotEncoder()
            ...
        }
       stages += VectorAssembler(featureColumns)
       val pipeline = Pipeline(stages)
       RFormulaModel(parsedFormula, pipeline)
     }
   }
   
   class RFormulaModel(parsedFormula, pipeline) {
     def transform(df) = {
       val withFeatures = pipeline.transform(df)
       transformLabel(withFeatures)
     }
   }
   ```
   In order to assemble arithmetic expressions in a feature column with 
`VectorAssembler`, the dataframe which is transformed by `RFormulaModel` needs 
to have these columns. One way would be to simply add these transformations 
inside `RFormulaModel.transform` method,  or another way would be to add a 
pipelined stage inside `RFormula.fit` method. Seeing that all feature column 
related transformations are done at `RFormula.fit` method, I chose to add a 
pipelined stage inside `RFormula.fit` method. 
   
   But indeed, having a transformer for just executing a couple of `expr` 
functions could be too much. If you think it's unnecessary to add an extra 
stage, let me know and I'll move it's transformations to `RFormulaModel` class.
    

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to