[
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-20339.
-------------------------------
Resolution: Invalid
(No need to paste that much redundant code.)
If it's a question it should to go [email protected].
For such a huge sequence of generating columns you are probably much better off
contstructing a Row directly in a transformation in one go instead of calling
withColumn hundreds of times. Or else disable code gen.
> Issue in regex_replace in Apache Spark Java
> -------------------------------------------
>
> Key: SPARK-20339
> URL: https://issues.apache.org/jira/browse/SPARK-20339
> Project: Spark
> Issue Type: Question
> Components: Java API, Spark Core, SQL
> Affects Versions: 2.1.0
> Reporter: Nischay
>
> We are currently facing couple of issues
> 1.
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator"
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this
> problem.
> {code}
> Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool
> Group");
> manufacturerNames.put("Jacobs","Apex Tool Group");
> manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
> manufacturerNames.put("Standard Safety","Standard
> Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard
> Safety Equipment Company");
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset<Row> dataFileContent =
> sqlContext.load("com.databricks.spark.csv", options);
>
>
> while(names.hasMoreElements()) {
> str = (String) names.nextElement();
>
> dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]