Lukas Thaler created SPARK-31299:
------------------------------------

             Summary: Pyspark.ml.clustering illegalArgumentException with 
dataframe created from rows
                 Key: SPARK-31299
                 URL: https://issues.apache.org/jira/browse/SPARK-31299
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 2.4.0
            Reporter: Lukas Thaler


I hope this is the right place and way to report a bug in (at least) the 
PySpark API:

BisectingKMeans in the following example is only exemplary, the error occurs 
with all clustering algorithms:
{code:java}
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeansdata = 
spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 
1.0, 1.0, 0.0, 3.0])),
 Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
 Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
 Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])

kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model = kmeans.fit(data)
{code}
The .fit-call in the last line will fail with the following error:
{code:java}
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: requirement failed: Column test_features 
must be of type equal to one of the following types: 
[struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, 
array<double>, array<float>] but was actually of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
{code}
As can be seen, the data type reported to be passed to the function is the 
first data type in the list of allowed data types, yet the call ends in an 
error because of it.

See my [StackOverflow 
issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]]
 for more context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to