Barry Becker created SPARK-20071: ------------------------------------ Summary: StringIndexer overflows Kryo serialization buffer when run on column with many long distinct values Key: SPARK-20071 URL: https://issues.apache.org/jira/browse/SPARK-20071 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Barry Becker Priority: Minor
I marked this as minor because there are workarounds. I have a 2 million row dataset with a string column that is mostly unique and contains many very long values. Most of the values are between 1,000 and 40,0000 characters long. I am using Kryoserializer and increased the spark.kryoserializer.buffer.max to 256m. If I try to run StringIndexer.fit on this column, I will get an OutOfMemory exception or more likely a Buffer overflow error like {code} org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 23. To avoid this, increase spark.kryoserializer.buffer.max value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) {code} This result is not that surprising given that we are trying to index a column like this, however, I can think of some suggestions that would help avoid the error and maybe help performance. These possible enhancements to StringIndexer might be hard, but I thought I would suggest them anyway, just in case they are not. 1) Add param for Top N values. I know that StringIndexer gives lower indices to the more commonly occurring values. It would be great if one could specify that I only want to index the top N values and long everything else into a special "Other" value. 2) Add param for label length limit. Only consider the first L characters of labels when doing the indexing. Either of these enhancements would work, but I suppose they can also be implemented with additional work as steps preceding the indexer in the pipeline. Perhaps topByKey could be used to replace the column with one that has the top values and "Other" as suggesed in 1). -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org