[jira] [Commented] (SPARK-20339) Issue in regex_replace in Apache Spark Java

Nischay (JIRA) Mon, 17 Apr 2017 06:27:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971054#comment-15971054
 ]


Nischay commented on SPARK-20339:
---------------------------------

Sure I'll not add redundant code in future, also I'll use [email protected]

"For such a huge sequence of generating columns you are probably much better 
off contstructing a Row directly in a transformation" we are not able to 
understand can you please explain in detail. 

We used UDF but getting "Task not serializable exception".
 
UDF1 removeSpecialCharaters = new UDF1<String, String>() {
        public String call(final String types) throws Exception {       
                while(names.hasMoreElements()) {
                        String str = (String) names.nextElement();
                       types.replaceAll(str, 
manufacturerNames.get(str).toString());
                }   
                return types;
}
};
sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters, 
DataTypes.StringType);
dataFileContent.createOrReplaceTempView("DataSetOfinterest");
Dataset<Row> newDF = sqlContext.sql("select 
removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest");


> Issue in regex_replace in Apache Spark Java
> -------------------------------------------
>
>                 Key: SPARK-20339
>                 URL: https://issues.apache.org/jira/browse/SPARK-20339
>             Project: Spark
>          Issue Type: Question
>          Components: Java API, Spark Core, SQL
>    Affects Versions: 2.1.0
>            Reporter: Nischay
>
> We are currently facing couple of issues
> 1. 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark 
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the 
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to 
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this 
> problem.
> {code}
>                       Hashtable manufacturerNames = new Hashtable();
>                         Enumeration names;
>                         String str;
>                         double bal;
>                         manufacturerNames.put("Allen","Apex Tool Group");
>                         manufacturerNames.put("Armstrong","Apex Tool Group");
>                         manufacturerNames.put("Campbell","Apex Tool Group");
>                         manufacturerNames.put("Lubriplate","Apex Tool Group");
>                         manufacturerNames.put("Delta","Apex Tool Group");
>                         manufacturerNames.put("Gearwrench","Apex Tool Group");
>                         manufacturerNames.put("H.K. Porter","Apex Tool 
> Group");
>                         manufacturerNames.put("Jacobs","Apex Tool Group");
>                         manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
>                         manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");
>                         manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");   
>                         // Show all balances in hash table.
>                         names = manufacturerNames.keys();
>                         Dataset<Row> dataFileContent = 
> sqlContext.load("com.databricks.spark.csv", options);
>                       
>                         
>                         while(names.hasMoreElements()) {
>                                str = (String) names.nextElement();
>                                
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
>                         }        
>                         dataFileContent.show();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20339) Issue in regex_replace in Apache Spark Java

Reply via email to