Hello list
We have a Spark application that performs a set of ETLs: reading messages from a Kafka topic, categorizing them, and writing the contents out as Parquet files on HDFS. After writing, we are querying the data from HDFS using Presto's hive integration. We are having problems because the Parquet files are frequently truncated after the Spark driver is killed or crashes. The meat of the (Scala) Spark jobs look like this: Spark .openSession() .initKafkaStream("our_topic") .filter(...) .map(...) .coalesce(1) .writeStream .trigger(ProcessingTime("1 hours")) .outputMode("append") .queryName("MyETL") .format("parquet") .option("path", path) .start() Is it expected that Parquet files could be truncated during crashes? Sometimes the files are only 4 bytes long, sometimes they are longer but still too short to be valid Parquet files. Presto detects the short files and refuses to query the entire table. I hoped the write out of the files would be transactional, so that incomplete files would not be output. We can fix crashes as they come up, but we will always need to kill the job periodically to deploy new versions of the code. We want to run the application as a long lived process that is continually reading from the Kafka queue and writing out to HDFS for archival purposes. Thanks, Dave Cameron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill-tp29043.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org