Worth noting that I got similar question around local community as well.
These reporters didn't encounter the edge-case, they're encountered the
critical issue in the normal running of streaming query.

On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> (bump to expose the discussion to more readers)
>
> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> Hi devs,
>>
>> I'm seeing more and more structured streaming end users encountered the
>> metadata issues on file stream source and sink. They have been known-issues
>> and there're even long-standing JIRA issues reported before, end users
>> report them again in user@ mailing list in April.
>>
>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>> input files | Spark -2.4.0 [1]
>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>
>> I've proposed various improvements on the area (see my PRs [3]) but
>> suffered on lack of interests/reviews. I feel the issue is critical
>> (under-estimated) because...
>>
>> 1. It's one of "built-in" data sources which is being maintained by Spark
>> community. (End users may judge the state of project/area on the quality on
>> the built-in data source, because that's the thing they would start with.)
>> 2. It's the only built-in data source which provides "end-to-end
>> exactly-once" in structured streaming.
>>
>> I'd hope to see us address such issues so that end users can live with
>> built-in data source. (It may not need to be perfect, but at least be
>> reasonable on the long-run streaming workloads.) I know there're couple of
>> alternatives, but I don't think starter would start from there. End users
>> may just try to find alternatives - not alternative of data source, but
>> alternative of streaming processing framework.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1.
>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>> 2.
>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>
>

Reply via email to