Bump again - hope to get some traction because these issues are either
long-standing problems or noticeable improvements (each PR has numbers/UI
graph to show the improvement).

Fixed long-standing problems:

* [SPARK-17604][SS] FileStreamSource: provide a new option to have
retention on input files [1]
* [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
on output files [2]

There's no logic to control the size of metadata for file stream source &
file stream sink, and it affects end users who run the streaming query with
many input files / output files in the long run. Both are to resolve
metadata growing incrementally over time. As the number of the issue
represents for SPARK-17604 it's a fairly old problem. There're at least
three relevant issues being reported on SPARK-27188.

Improvements:

* [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond
maxFilesPerTrigger as unread files [3]
* [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log
twice if the query restarts from compact batch [4]
* [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with
LZ4 compression on FileStream(Source/Sink)Log [5]

Above patches provide better performance on the condition described on each
PR. Worth noting, SPARK-30946 provides pretty much better performance
(~10x) on compaction per every compact batch, whereas it also reduces down
the compact batch log file (~30% of current).

1. https://github.com/apache/spark/pull/28422
2. https://github.com/apache/spark/pull/28363
3. https://github.com/apache/spark/pull/27620
4. https://github.com/apache/spark/pull/27649
5. https://github.com/apache/spark/pull/27694


On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Worth noting that I got similar question around local community as well.
> These reporters didn't encounter the edge-case, they're encountered the
> critical issue in the normal running of streaming query.
>
> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> (bump to expose the discussion to more readers)
>>
>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Hi devs,
>>>
>>> I'm seeing more and more structured streaming end users encountered the
>>> metadata issues on file stream source and sink. They have been known-issues
>>> and there're even long-standing JIRA issues reported before, end users
>>> report them again in user@ mailing list in April.
>>>
>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>>> input files | Spark -2.4.0 [1]
>>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>>
>>> I've proposed various improvements on the area (see my PRs [3]) but
>>> suffered on lack of interests/reviews. I feel the issue is critical
>>> (under-estimated) because...
>>>
>>> 1. It's one of "built-in" data sources which is being maintained by
>>> Spark community. (End users may judge the state of project/area on the
>>> quality on the built-in data source, because that's the thing they would
>>> start with.)
>>> 2. It's the only built-in data source which provides "end-to-end
>>> exactly-once" in structured streaming.
>>>
>>> I'd hope to see us address such issues so that end users can live with
>>> built-in data source. (It may not need to be perfect, but at least be
>>> reasonable on the long-run streaming workloads.) I know there're couple of
>>> alternatives, but I don't think starter would start from there. End users
>>> may just try to find alternatives - not alternative of data source, but
>>> alternative of streaming processing framework.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 2.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>

Reply via email to