GitHub user uncleGen opened a pull request:
https://github.com/apache/spark/pull/17395
[SPARK-20065][SS] Avoid to output empty parquet files
## Problem Description
Reported by Silvio Fiorito
I've got a Kafka topic which I'm querying, running a windowed aggregation,
with a 30 second watermark, 10 second trigger, writing out to Parquet with
append output mode.
Every 10 second trigger generates a file, regardless of whether there was
any data for that trigger, or whether any records were actually finalized by
the watermark.
Is this expected behavior or should it not write out these empty files?
```
val df = spark.readStream.format("kafka")....
val query = df
.withWatermark("timestamp", "30 seconds")
.groupBy(window($"timestamp", "10 seconds"))
.count()
.select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")
query
.writeStream
.format("parquet")
.option("checkpointLocation", aggChk)
.trigger(ProcessingTime("10 seconds"))
.outputMode("append")
.start(aggPath)
```
As the query executes, do a file listing on "aggPath" and you'll see 339
byte files at a minimum until we arrive at the first watermark and the initial
batch is finalized. Even after that though, as there are empty batches it'll
keep generating empty files every trigger.
## What changes were proposed in this pull request?
Check the partition is empty or not, and skip empty partition to avoid
output empty file.
## How was this patch tested?
Jenkins
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/uncleGen/spark SPARK-20065
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17395.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17395
----
commit 86a7d2fa96e3134c1e64864eba81a3bebdedceea
Author: uncleGen <[email protected]>
Date: 2017-03-23T08:10:31Z
avoid to output empty parquet files
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]