[
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971054#comment-15971054
]
Nischay commented on SPARK-20339:
---------------------------------
Sure I'll not add redundant code in future, also I'll use [email protected]
"For such a huge sequence of generating columns you are probably much better
off contstructing a Row directly in a transformation" we are not able to
understand can you please explain in detail.
We used UDF but getting "Task not serializable exception".
UDF1 removeSpecialCharaters = new UDF1<String, String>() {
public String call(final String types) throws Exception {
while(names.hasMoreElements()) {
String str = (String) names.nextElement();
types.replaceAll(str,
manufacturerNames.get(str).toString());
}
return types;
}
};
sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters,
DataTypes.StringType);
dataFileContent.createOrReplaceTempView("DataSetOfinterest");
Dataset<Row> newDF = sqlContext.sql("select
removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest");
> Issue in regex_replace in Apache Spark Java
> -------------------------------------------
>
> Key: SPARK-20339
> URL: https://issues.apache.org/jira/browse/SPARK-20339
> Project: Spark
> Issue Type: Question
> Components: Java API, Spark Core, SQL
> Affects Versions: 2.1.0
> Reporter: Nischay
>
> We are currently facing couple of issues
> 1.
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator"
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this
> problem.
> {code}
> Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool
> Group");
> manufacturerNames.put("Jacobs","Apex Tool Group");
> manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
> manufacturerNames.put("Standard Safety","Standard
> Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard
> Safety Equipment Company");
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset<Row> dataFileContent =
> sqlContext.load("com.databricks.spark.csv", options);
>
>
> while(names.hasMoreElements()) {
> str = (String) names.nextElement();
>
> dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]