small correction: "I intentionally didn't enumerate." The meaning could be
quite different so making a small correction.

On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>
> We made non-trivial numbers of band-aid fixes already for file stream
> sink. For example,
>
> https://github.com/apache/spark/pull/28363
> https://github.com/apache/spark/pull/28904
> https://github.com/apache/spark/pull/29505
> https://github.com/apache/spark/pull/31638
>
> There were many push backs, because these fixes do not solve the real
> problem. The consensus was that we don't want to come up with another Data
> Lake product which requires us to put months (or maybe years) of effort.
> Now, these Data Lake products are backed by companies and they are
> successful projects as individuals. I'm not sure I can be supportive with
> the effort on another band-aid fix.
>
> Maintaining metadata directory is a root of the headache. Unless we see
> the benefit of removing the metadata directory (hence at-least-once) and
> plan to deal with that, I'd like to leave file stream sink as it is.
>
> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk <wojciechin...@gmail.com>
> wrote:
>
>> Hi Jungtaek,
>> integration with Delta Lake is not an option to me, I raised a PR for
>> improvement of FileStreamSink with the new parameter:
>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>>
>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> napisał(a):
>>
>>> Hi,
>>>
>>> We have been indicated with lots of issues with the current FileStream
>>> sink. The effort to fix these issues are quite significant, and it ended up
>>> with derivation of "Data Lake" products.
>>>
>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>> work in Databricks so I might be biased, but even when I was working at the
>>> previous employer which didn't have the Data Lake product at that time, I
>>> also had to agree that there are too many things to fix, and the effort
>>> would be fully redundant with existing products.
>>>
>>> Maybe, it might be helpful to have an "at-least-once" version of
>>> FileStream sink, where a metadata directory is no longer needed. It may
>>> require the implementation to go back to the old way of atomic renaming,
>>> but it will also get rid of the necessity of a metadata directory, so
>>> someone might find it useful. For end-to-end exactly once, people can
>>> either use a limited current FileStream sink or use Data Lake products. I
>>> don't see the value in making improvements to the current FileStream sink.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <wojciechin...@gmail.com>
>>> wrote:
>>>
>>>> Hi!
>>>> I raised a ticket on parametrisable output metadata path
>>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>>> I am going to raise a PR against it and I realised, that this
>>>> relatively simple change impacts on method hasMetadata(path), that would
>>>> have a new meaning if we can define custom path for metadata of output
>>>> files. Can you please share your opinion on  how the custom output metadata
>>>> path can impact on design of structured streaming?
>>>> E.g. I can see one case when I set a parameter of output metadata path,
>>>> run a job on output path A, stop the job, change the output path to B and
>>>> hasMetadata works well. If you have any corner case in mind where the
>>>> parametrised output metadata path can break something please describe it.
>>>>
>>>> --
>>>> Kind regards/ Pozdrawiam,
>>>> Wojciech Indyk
>>>>
>>>

Reply via email to