[jira] [Resolved] (SPARK-20339) Issue in regex_replace in Apache Spark Java

Sean Owen (JIRA) Sat, 15 Apr 2017 05:24:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-20339.
-------------------------------
    Resolution: Invalid

(No need to paste that much redundant code.)
If it's a question it should to go [email protected].
For such a huge sequence of generating columns you are probably much better off 
contstructing a Row directly in a transformation in one go instead of calling 
withColumn hundreds of times. Or else disable code gen.

> Issue in regex_replace in Apache Spark Java
> -------------------------------------------
>
>                 Key: SPARK-20339
>                 URL: https://issues.apache.org/jira/browse/SPARK-20339
>             Project: Spark
>          Issue Type: Question
>          Components: Java API, Spark Core, SQL
>    Affects Versions: 2.1.0
>            Reporter: Nischay
>
> We are currently facing couple of issues
> 1. 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark 
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the 
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to 
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this 
> problem.
> {code}
>                       Hashtable manufacturerNames = new Hashtable();
>                         Enumeration names;
>                         String str;
>                         double bal;
>                         manufacturerNames.put("Allen","Apex Tool Group");
>                         manufacturerNames.put("Armstrong","Apex Tool Group");
>                         manufacturerNames.put("Campbell","Apex Tool Group");
>                         manufacturerNames.put("Lubriplate","Apex Tool Group");
>                         manufacturerNames.put("Delta","Apex Tool Group");
>                         manufacturerNames.put("Gearwrench","Apex Tool Group");
>                         manufacturerNames.put("H.K. Porter","Apex Tool 
> Group");
>                         manufacturerNames.put("Jacobs","Apex Tool Group");
>                         manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
>                         manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");
>                         manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");   
>                         // Show all balances in hash table.
>                         names = manufacturerNames.keys();
>                         Dataset<Row> dataFileContent = 
> sqlContext.load("com.databricks.spark.csv", options);
>                       
>                         
>                         while(names.hasMoreElements()) {
>                                str = (String) names.nextElement();
>                                
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
>                         }        
>                         dataFileContent.show();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-20339) Issue in regex_replace in Apache Spark Java

Reply via email to