brian wang created SPARK-24095:
----------------------------------

             Summary: Spark Streaming performance drastically drops when when 
saving dataframes with withColumn
                 Key: SPARK-24095
                 URL: https://issues.apache.org/jira/browse/SPARK-24095
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: brian wang


We have a Spark Streaming application which is streaming data from Kafka and 
ingesting the data in HDFS after a series of transformations. We are using 
Spark SQL to do the transformations and storing the data into HDFS at two 
stages. The ingestion to Spark which we do at the second stage is drastically 
reducing the performance of the application.
There are close to 40 Million transactions per hour in the incoming data. WE 
have observed a performance bottleneck in the write to hdfs.
Can you please help us optimize the application performance?
This is a critical issue since it is holding our deployment to production 
cluster and we are running behind the schedule in production deployment.

 

Answer: First Stage Save

test_Transformed_DOW.cache().withColumn("test_class_map", udf(test_class_map, 
StringType())(array(test_class))).write.mode("append").option("header","true").csv("/hive/warehouse/test")

Second Stage Save

test_Data_Final=spark.sql("select test1,test2,test3...... when int(seats)>=2 
then 1 when int(seats) < 2 then 0 end as seats from 
test_Data_Unpivoted").write.format("parquet").mode("append").saveAsTable("test_Data_Output")

It is the first save stage which is slowing our spark application's performance 
if we enable it. If we disable it, the application seems to catch up with the 
incoming data flow.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to