[
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Parth Gandhi resolved SPARK-26947.
----------------------------------
Resolution: Invalid
> Pyspark KMeans Clustering job fails on large values of k
> --------------------------------------------------------
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib, PySpark
> Affects Versions: 2.4.0
> Reporter: Parth Gandhi
> Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering
> was failing for large values of k. I was able to reproduce the same issue
> with dummy dataset. I have attached the code as well as the data in the JIRA.
> The stack trace is printed below from Java:
>
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
> at java.lang.StringBuilder.append(StringBuilder.java:202)
> at py4j.Protocol.getOutputCommand(Protocol.java:328)
> at py4j.commands.CallCommand.execute(CallCommand.java:81)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 985, in send_command
> response = connection.send_command(command)
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
> File "clustering_app.py", line 154, in <module>
> main(args)
> File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path,
> args.num_clusters_list)
> File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path,
> k, max_iter)
> File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
> line 337, in clusterCenters
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
> line 55, in _call_java
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
> line 109, in _java2py
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
> line 63, in deco
> File
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x
> --output_path hdfs:///user/username --num_clusters_list 10000
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory
> to both driver and executor is close to 20 GB. This only happens for large
> values of k.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]