Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/19892#discussion_r161683714
--- Diff: python/pyspark/ml/feature.py ---
@@ -347,6 +353,28 @@ class Bucketizer(JavaTransformer, HasInputCol,
HasOutputCol, HasHandleInvalid,
>>> bucketed =
bucketizer.setHandleInvalid("skip").transform(df).collect()
>>> len(bucketed)
4
+ >>> bucketizer2 = Bucketizer(splitsArray=
+ ... [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.5,
float("inf")]],
+ ... inputCols=["values", "numbers"], outputCols=["buckets1",
"buckets2"])
+ >>> bucketed2 =
bucketizer2.setHandleInvalid("keep").transform(df).collect()
+ >>> len(bucketed2)
+ 6
+ >>> bucketed2[0].buckets1
--- End diff --
Perhaps it would be cleaner to do a `df.show()` here? Likewise above for
`bucketed` we could change that part of the doctest too.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]