[
https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-20071:
------------------------------
Issue Type: Improvement (was: Bug)
Not a bug, right?
You can effect some of this yourself with preprocessing. Yes it's manual but it
means you have full flexibility. Putting it in StringIndexer may not actually
simplify things.
> StringIndexer overflows Kryo serialization buffer when run on column with
> many long distinct values
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-20071
> URL: https://issues.apache.org/jira/browse/SPARK-20071
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.1.0
> Reporter: Barry Becker
> Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and
> contains many very long values.
> Most of the values are between 1,000 and 40,0000 characters long.
> I am using Kryoserializer and increased the spark.kryoserializer.buffer.max
> to 256m.
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory
> exception or more likely a Buffer overflow error like
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow.
> Available: 0, required: 23.
> To avoid this, increase spark.kryoserializer.buffer.max
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column
> like this, however, I can think of some suggestions that would help avoid the
> error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I
> would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices
> to the more commonly occurring values. It would be great if one could specify
> that I only want to index the top N values and long everything else into a
> special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of
> labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be
> implemented with additional work as steps preceding the indexer in the
> pipeline. Perhaps topByKey could be used to replace the column with one that
> has the top values and "Other" as suggesed in 1).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]