[jira] [Updated] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

Sean Owen (JIRA) Thu, 23 Mar 2017 07:21:33 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-20071:
------------------------------
    Issue Type: Improvement  (was: Bug)

Not a bug, right?
You can effect some of this yourself with preprocessing. Yes it's manual but it 
means you have full flexibility. Putting it in StringIndexer may not actually 
simplify things.

> StringIndexer overflows Kryo serialization buffer when run on column with 
> many long distinct values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20071
>                 URL: https://issues.apache.org/jira/browse/SPARK-20071
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Barry Becker
>            Priority: Minor
>
> I marked this as minor because there are workarounds.
> I have a 2 million row dataset with a string column that is mostly unique and 
> contains many very long values.
> Most of the values are between 1,000 and 40,0000 characters long.
> I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max 
> to 256m. 
> If I try to run StringIndexer.fit on this column, I will get an OutOfMemory 
> exception or more likely a Buffer overflow error like 
> {code}
> org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
> Available: 0, required: 23. 
> To avoid this, increase spark.kryoserializer.buffer.max 
> value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>  
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> This result is not that surprising given that we are trying to index a column 
> like this, however, I can think of some suggestions that would help avoid the 
> error and maybe help performance.
> These possible enhancements to StringIndexer might be hard, but I thought I 
> would suggest them anyway, just in case they are not.
> 1) Add param for Top N values. I know that StringIndexer gives lower indices 
> to the more commonly occurring values. It would be great if one could specify 
> that I only want to index the top N values and long everything else into a 
> special "Other" value.
> 2) Add param for label length limit. Only consider the first L characters of 
> labels when doing the indexing.
> Either of these enhancements would work, but I suppose they can also be 
> implemented with additional work as steps preceding the indexer in the 
> pipeline. Perhaps topByKey could be used to replace the column with one that 
> has the top values and "Other" as suggesed in 1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-20071) StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values

Reply via email to