Barry Becker created SPARK-20071:
------------------------------------
Summary: StringIndexer overflows Kryo serialization buffer when
run on column with many long distinct values
Key: SPARK-20071
URL: https://issues.apache.org/jira/browse/SPARK-20071
Project: Spark
Issue Type: Bug
Components: ML
Affects Versions: 2.1.0
Reporter: Barry Becker
Priority: Minor
I marked this as minor because there are workarounds.
I have a 2 million row dataset with a string column that is mostly unique and
contains many very long values.
Most of the values are between 1,000 and 40,0000 characters long.
I am using Kryoserializer and increased the spark.kryoserializer.buffer.max to
256m.
If I try to run StringIndexer.fit on this column, I will get an OutOfMemory
exception or more likely a Buffer overflow error like
{code}
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow.
Available: 0, required: 23.
To avoid this, increase spark.kryoserializer.buffer.max
value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{code}
This result is not that surprising given that we are trying to index a column
like this, however, I can think of some suggestions that would help avoid the
error and maybe help performance.
These possible enhancements to StringIndexer might be hard, but I thought I
would suggest them anyway, just in case they are not.
1) Add param for Top N values. I know that StringIndexer gives lower indices to
the more commonly occurring values. It would be great if one could specify that
I only want to index the top N values and long everything else into a special
"Other" value.
2) Add param for label length limit. Only consider the first L characters of
labels when doing the indexing.
Either of these enhancements would work, but I suppose they can also be
implemented with additional work as steps preceding the indexer in the
pipeline. Perhaps topByKey could be used to replace the column with one that
has the top values and "Other" as suggesed in 1).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]