[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

hhbyyh Thu, 16 Mar 2017 11:06:07 -0700

Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17316#discussion_r106490851
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -871,6 +872,164 @@ def idf(self):
     
     
     @inherit_doc
    +class Imputer(JavaEstimator, HasInputCols, JavaMLReadable, JavaMLWritable):
    +    """
    +    .. note:: Experimental
    +
    +    Imputation estimator for completing missing values, either using the 
mean or the median
    +    of the column in which the missing values are located. The input 
column should be of
    +    DoubleType or FloatType. Currently Imputer does not support 
categorical features and
    +    possibly creates incorrect values for a categorical feature.
    +
    +    Note that the mean/median value is computed after filtering out 
missing values.
    +    All Null values in the input column are treated as missing, and so are 
also imputed. For
    +    computing median, :py:meth:`approxQuantile` is used with a relative 
error of 0.001.
    +
    +    >>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, 
float("nan")), (float("nan"), 3.0),
    +    ...                             (4.0, 4.0), (5.0, 5.0)], ["a", "b"])
    +    >>> imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", 
"out_b"])
    +    >>> model = imputer.fit(df)
    +    >>> model.surrogateDF.show()
    +    +---+---+
    +    |  a|  b|
    +    +---+---+
    +    |3.0|4.0|
    +    +---+---+
    +    ...
    +    >>> model.transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  1.0|  4.0|
    +    |2.0|NaN|  2.0|  4.0|
    +    |NaN|3.0|  3.0|  3.0|
    +    ...
    +    >>> 
imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show()
    +    +---+---+-----+-----+
    +    |  a|  b|out_a|out_b|
    +    +---+---+-----+-----+
    +    |1.0|NaN|  4.0|  NaN|
    +    ...
    +    >>> imputerPath = temp_path + "/imputer"
    +    >>> imputer.save(imputerPath)
    +    >>> loadedImputer = Imputer.load(imputerPath)
    +    >>> loadedImputer.getStrategy() == imputer.getStrategy()
    +    True
    +    >>> loadedImputer.getMissingValue()
    +    1.0
    +    >>> modelPath = temp_path + "/imputer-model"
    +    >>> model.save(modelPath)
    +    >>> loadedModel = ImputerModel.load(modelPath)
    +    >>> loadedModel.transform(df).head().out_a == 
model.transform(df).head().out_a
    +    True
    +
    +    .. versionadded:: 2.2.0
    +    """
    +
    +    outputCols = Param(Params._dummy(), "outputCols",
    +                       "output column names.", 
typeConverter=TypeConverters.toListString)
    +
    +    strategy = Param(Params._dummy(), "strategy",
    +                     "strategy for imputation. If mean, then replace 
missing values using the mean "
    +                     "value of the feature. If median, then replace 
missing values using the "
    +                     "median value of the feature.",
    +                     typeConverter=TypeConverters.toString)
    +
    +    missingValue = Param(Params._dummy(), "missingValue",
    +                         "The placeholder for the missing values. All 
occurrences of missingValue "
    +                         "will be imputed.", 
typeConverter=TypeConverters.toFloat)
    +
    +    @keyword_only
    +    def __init__(self, strategy="mean", missingValue=float("nan"), 
inputCols=None,
    +                 outputCols=None):
    +        """
    +        __init__(self, strategy="mean", missingValue=float("nan"), 
inputCols=None, \
    +                 outputCols=None):
    +        """
    +        super(Imputer, self).__init__()
    +        self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.Imputer", self.uid)
    +        self._setDefault(strategy="mean", missingValue=float("nan"))
    +        kwargs = self._input_kwargs
    +        self.setParams(**kwargs)
    +
    +    @keyword_only
    +    @since("2.2.0")
    +    def setParams(self, strategy="mean", missingValue=float("nan"), 
inputCols=None,
    +                  outputCols=None):
    +        """
    +        setParams(self, strategy="mean", missingValue=float("nan"), 
inputCols=None, \
    +                  outputCols=None)
    +        Sets params for this Imputer.
    +        """
    +        kwargs = self._input_kwargs
    +        return self._set(**kwargs)
    +
    +    @since("2.2.0")
    +    def setOutputCols(self, value):
    +        """
    +        Sets the value of :py:attr:`outputCols`.
    +        """
    +        return self._set(outputCols=value)
    +
    +    @since("2.2.0")
    +    def getOutputCols(self):
    +        """
    +        Gets the value of :py:attr:`outputCols` or its default value.
    +        """
    +        return self.getOrDefault(self.outputCols)
    --- End diff --
    
    This reminds me we should add 
    ```
        require(get(inputCols).isDefined, "Input cols must be defined first.")
        require(get(outputCol).isDefined, "Output col must be defined first.")
    ```
    in transformschema



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17316: [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark

Reply via email to