[GitHub] [spark] HeartSaVioR opened a new pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

GitBox Thu, 30 Apr 2020 05:18:21 -0700


HeartSaVioR opened a new pull request #28422:
URL: https://github.com/apache/spark/pull/28422

### What changes were proposed in this pull request?

This PR introduces a new option, `inputRetention` to provide a way to
specify retention on input files.

`maxAgeMs` plays as `soft` limit (it doesn't apply for some conditions like
first batch, as well as it's applied relatively to the modified time of input
files). Given it's not consistently applied across the matrix of
configurations, Spark cannot purge the entries based on the configuration.
(Streaming query can change the configurations and be relaunched.)

`inputRetention` plays as `hard` limit - Spark will not include files older
than the retention as input files, as well as tries to exclude file entries
older than the retention (it actually happens on compaction, as it's the only
phase to remove entries).

`inputRetention` is relative to the system timestamp unlike `maxAgeMs`,
which is easier for end users to reason about. This would require end users to
correctly set the nodes' timestamp, but in most cases they would do it in other
reasons as well. Also, this would filter out old files when the query intends
to replay from input files, hence this should be considered as well.

### Why are the changes needed?

This has been a pain to deal with metadata growing in both file stream
source and file stream sink. For file stream source, all processed input files
are tracked which size is continuously growing, and there's no approach on
reducing the size/entries. In compact batch, it reads all previous input files
to write new compact file, which brings major latency.

The issue is even reported from user group, refer here:
https://lists.apache.org/thread.html/r897771f5526d10d0b13da9177a6b7d2e378888b22823c839cceea457%40%3Cuser.spark.apache.org%3E

### Does this PR introduce _any_ user-facing change?

This doesn't bring any change "by default", as the new configuration is
optional. (The default value is set to unrealistic one making it effectively
none.)

This adds a new configuration - previous sections described the behavior.

### How was this patch tested?

New UTs verifying two behaviors per test.

1) old files should not be included as input files if input retention is
specified
2) when compacting, outdated entries should be filtered out

I've manually tested with above two behaviors as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

Reply via email to