brian wang created SPARK-24095:
----------------------------------
Summary: Spark Streaming performance drastically drops when when
saving dataframes with withColumn
Key: SPARK-24095
URL: https://issues.apache.org/jira/browse/SPARK-24095
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: brian wang
We have a Spark Streaming application which is streaming data from Kafka and
ingesting the data in HDFS after a series of transformations. We are using
Spark SQL to do the transformations and storing the data into HDFS at two
stages. The ingestion to Spark which we do at the second stage is drastically
reducing the performance of the application.
There are close to 40 Million transactions per hour in the incoming data. WE
have observed a performance bottleneck in the write to hdfs.
Can you please help us optimize the application performance?
This is a critical issue since it is holding our deployment to production
cluster and we are running behind the schedule in production deployment.
Answer: First Stage Save
test_Transformed_DOW.cache().withColumn("test_class_map", udf(test_class_map,
StringType())(array(test_class))).write.mode("append").option("header","true").csv("/hive/warehouse/test")
Second Stage Save
test_Data_Final=spark.sql("select test1,test2,test3...... when int(seats)>=2
then 1 when int(seats) < 2 then 0 end as seats from
test_Data_Unpivoted").write.format("parquet").mode("append").saveAsTable("test_Data_Output")
It is the first save stage which is slowing our spark application's performance
if we enable it. If we disable it, the application seems to catch up with the
incoming data flow.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]