HyukjinKwon commented on code in PR #39628:
URL: https://github.com/apache/spark/pull/39628#discussion_r1082211081
##########
python/pyspark/ml/functions.py:
##########
@@ -647,37 +386,369 @@ def predict_columnar(x1: np.ndarray, x2: np.ndarray) ->
Mapping[str, np.ndarray]
Function which is responsible for loading a model and returning a
:py:class:`PredictBatchFunction` which takes one or more numpy arrays
as input and returns
one of the following:
- - a numpy array (for a single output)
- - a dictionary of named numpy arrays (for multiple outputs)
- - a row-oriented list of dictionaries (for multiple outputs).
+
+ * a numpy array (for a single output)
+ * a dictionary of named numpy arrays (for multiple outputs)
+ * a row-oriented list of dictionaries (for multiple outputs).
+
For a dictionary of named numpy arrays, the arrays can only be one or
two dimensional, since
- higher dimension arrays are not supported. For a row-oriented list of
dictionaries, each
+ higher dimensional arrays are not supported. For a row-oriented list
of dictionaries, each
element in the dictionary must be either a scalar or one-dimensional
array.
- return_type : :class:`pspark.sql.types.DataType` or str.
+ return_type : :py:class:`pyspark.sql.types.DataType` or str.
Spark SQL datatype for the expected output:
- - Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array.
- - ArrayType --> 2-dim numpy array.
- - StructType --> dict with keys matching struct fields.
- - StructType --> list of dict with keys matching struct fields, for
models like the
- [Huggingface pipeline for sentiment
analysis](https://huggingface.co/docs/transformers/quicktour#pipeline-usage] #
noqa: E501
+
+ * Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array.
+ * ArrayType --> 2-dim numpy array.
+ * StructType --> dict with keys matching struct fields.
+ * StructType --> list of dict with keys matching struct fields, for
models like the
+ `Huggingface pipeline for sentiment analysis
+
<https://huggingface.co/docs/transformers/quicktour#pipeline-usage>`_.
+
batch_size : int
- Batch size to use for inference, note that this is typically a
limitation of the model
- and/or the hardware resources and is usually smaller than the Spark
partition size.
- input_tensor_shapes: List[List[int] | None] | Mapping[int, List[int]] |
None
- Optional input tensor shapes for models with tensor inputs. This can
be a list of shapes,
+ Batch size to use for inference. This is typically a limitation of
the model
+ and/or available hardware resources and is usually smaller than the
Spark partition size.
+ input_tensor_shapes : List[List[int] | None] | Mapping[int, List[int]] |
None, optional
+ Input tensor shapes for models with tensor inputs. This can be a list
of shapes,
where each shape is a list of integers or None (for scalar inputs).
Alternatively, this
can be represented by a "sparse" dictionary, where the keys are the
integer indices of the
inputs, and the values are the shapes. Each tensor input value in the
Spark DataFrame must
be represented as a single column containing a flattened 1-D array.
The provided
- input_tensor_shapes will be used to reshape the flattened array into
expected tensor shape.
- For the list form, the order of the tensor shapes must match the order
of the selected
- DataFrame columns. The batch dimension (typically -1 or None in the
first dimension) should
- not be included, since it will be determined by the batch_size
argument. Tabular datasets
- with scalar-valued columns should not provide this argument.
+ `input_tensor_shapes` will be used to reshape the flattened array into
the expected tensor
+ shape. For the list form, the order of the tensor shapes must match
the order of the
+ selected DataFrame columns. The batch dimension (typically -1 or None
in the first
+ dimension) should not be included, since it will be determined by the
batch_size argument.
+ Tabular datasets with scalar-valued columns should not provide this
argument.
Returns
-------
- A pandas_udf for predicting a batch.
+ :py:class:`UserDefinedFunctionLike`
+ A Pandas UDF for model inference on a Spark DataFrame.
+
+ Examples
+ --------
+ For a pre-trained TensorFlow MNIST model with two-dimensional input images
represented as a
+ flattened tensor value stored in a single Spark DataFrame column of type
`array<float>`.::
+
+ from pyspark.ml.functions import predict_batch_udf
Review Comment:
If this example cannot be ran, should better do it with `.. code-block::
python`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]