Hello list

We have a Spark application that performs a set of ETLs: reading messages
from a Kafka topic, categorizing them, and writing the contents out as
Parquet files on HDFS. After writing, we are querying the data from HDFS
using Presto's hive integration. We are having problems because the Parquet
files are frequently truncated after the Spark driver is killed or crashes.

The meat of the (Scala) Spark jobs look like this:
Spark
  .openSession()
  .initKafkaStream("our_topic")
  .filter(...)
  .map(...)
  .coalesce(1)
  .writeStream
  .trigger(ProcessingTime("1 hours"))
  .outputMode("append")
  .queryName("MyETL")
  .format("parquet")
  .option("path", path)
  .start()

Is it expected that Parquet files could be truncated during crashes? 

Sometimes the files are only 4 bytes long, sometimes they are longer but
still too short to be valid Parquet files. Presto detects the short files
and refuses to query the entire table. I hoped the write out of the files
would be transactional, so that incomplete files would not be output.

We can fix crashes as they come up, but we will always need to kill the job
periodically to deploy new versions of the code. We want to run the
application as a long lived process that is continually reading from the
Kafka queue and writing out to HDFS for archival purposes.


Thanks,
Dave Cameron




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tp29043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to