[
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anders Rydbirk updated SPARK-36553:
-----------------------------------
Description:
We are running KMeans on approximately 350M rows of x, y, z coordinates using
the following configuration:
{code:java}
KMeans(
featuresCol='features',
predictionCol='centroid_id',
k=50000,
initMode='k-means||',
initSteps=2,
tol=0.00005,
maxIter=20,
seed=SEED,
distanceMeasure='euclidean'
)
{code}
When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are
consistently getting errors unless we reduce K.
Stacktrace:
{code:java}
An error occurred while calling o167.fit.An error occurred while calling
o167.fit.: java.lang.NegativeArraySizeException: -897458648 at
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at
scala.Array$.ofDim(Array.scala:221) at
org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
at
org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) at
org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213) at
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source) at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.base/java.lang.Thread.run(Unknown Source)
{code}
The issue is introduced by
[#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
which significantly reduces the maximum value of K. Snippit of line that
throws error from [DistanceMeasure.scala:|#L52]]
{code:java}
val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
{code}
*What we have tried:*
* Reducing iterations
* Reducing input volume
* Reducing K
Only reducing K have yielded success.
*What we don't understand*:
Given the line of code above, we do not understand why we would get an integer
overflow:
For K=50,000, packedValues should be allocated with the size of 1,250,025,000 <
(2^31) and not result in a negative array size.
Please let me know if more information is needed, this is my first time raising
a bug for a OS.
was:
We are running KMeans on approximately 350M rows of x, y, z coordinates using
the following configuration:
{code:java}
KMeans(
featuresCol='features',
predictionCol='centroid_id',
k=50000,
initMode='k-means||',
initSteps=2,
tol=0.00005,
maxIter=20,
seed=SEED,
distanceMeasure='euclidean'
)
{code}
When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are
consistently getting errors unless we reduce K.
Stacktrace:
{code:java}
An error occurred while calling o167.fit.An error occurred while calling
o167.fit.: java.lang.NegativeArraySizeException: -897458648 at
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at
scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at
scala.Array$.ofDim(Array.scala:221) at
org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
at
org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) at
org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at
org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213) at
org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source) at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.base/java.lang.Thread.run(Unknown Source)
{code}
The issue is introduced by
[#27758|[https://github.com/apache/spark/pull/27758/files#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
which significantly reduces the maximum value of K:
[DistanceMeasure.scala|[https://github.com/zhengruifeng/spark/blob/d31d488e0e48a82fd5b43c406f07b8c7d27dd53c/mllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala#L52]]
{code:java}
val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
{code}
*What we have tried:*
* Reducing iterations
* Reducing input volume
* Reducing K
Only reducing K have yielded success.
*What we don't understand*:
**Given the line of code above, we do not understand why we would get an
integer overflow:
For K=50,000, packedValues should be allocated with the size of 1,250,025,000 <
(2^31) and not result in a negative array size.
Please let me know if more information is needed, this is my first time raising
a bug for a OS.
> KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758
> was introduced
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib, PySpark
> Affects Versions: 3.1.1
> Reporter: Anders Rydbirk
> Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using
> the following configuration:
> {code:java}
> KMeans(
> featuresCol='features',
> predictionCol='centroid_id',
> k=50000,
> initMode='k-means||',
> initSteps=2,
> tol=0.00005,
> maxIter=20,
> seed=SEED,
> distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are
> consistently getting errors unless we reduce K.
> Stacktrace:
>
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at
> scala.Array$.ofDim(Array.scala:221) at
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
> at
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
> at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231)
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source) at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
> py4j.Gateway.invoke(Gateway.java:282) at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
> py4j.commands.CallCommand.execute(CallCommand.java:79) at
> py4j.GatewayConnection.run(GatewayConnection.java:238) at
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>
> The issue is introduced by
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
> which significantly reduces the maximum value of K. Snippit of line that
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>
> *What we have tried:*
> * Reducing iterations
> * Reducing input volume
> * Reducing K
> Only reducing K have yielded success.
>
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an
> integer overflow:
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000
> < (2^31) and not result in a negative array size.
> Please let me know if more information is needed, this is my first time
> raising a bug for a OS.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]