small correction: "I intentionally didn't enumerate." The meaning could be quite different so making a small correction.
On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > There seems to be miscommunication - I didn't mean "Delta Lake". I meant > "any" Data Lake products. Since I'm biased I didn't intentionally enumerate > actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well. > > We made non-trivial numbers of band-aid fixes already for file stream > sink. For example, > > https://github.com/apache/spark/pull/28363 > https://github.com/apache/spark/pull/28904 > https://github.com/apache/spark/pull/29505 > https://github.com/apache/spark/pull/31638 > > There were many push backs, because these fixes do not solve the real > problem. The consensus was that we don't want to come up with another Data > Lake product which requires us to put months (or maybe years) of effort. > Now, these Data Lake products are backed by companies and they are > successful projects as individuals. I'm not sure I can be supportive with > the effort on another band-aid fix. > > Maintaining metadata directory is a root of the headache. Unless we see > the benefit of removing the metadata directory (hence at-least-once) and > plan to deal with that, I'd like to leave file stream sink as it is. > > On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk <wojciechin...@gmail.com> > wrote: > >> Hi Jungtaek, >> integration with Delta Lake is not an option to me, I raised a PR for >> improvement of FileStreamSink with the new parameter: >> https://github.com/apache/spark/pull/40821. Can you please take a look? >> >> -- >> Kind regards/ Pozdrawiam, >> Wojciech Indyk >> >> >> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <kabhwan.opensou...@gmail.com> >> napisał(a): >> >>> Hi, >>> >>> We have been indicated with lots of issues with the current FileStream >>> sink. The effort to fix these issues are quite significant, and it ended up >>> with derivation of "Data Lake" products. >>> >>> I'd recommend not to fix the issue but leave it as its limitation, and >>> integrate your workload with Data Lake products. For a full disclaimer, I >>> work in Databricks so I might be biased, but even when I was working at the >>> previous employer which didn't have the Data Lake product at that time, I >>> also had to agree that there are too many things to fix, and the effort >>> would be fully redundant with existing products. >>> >>> Maybe, it might be helpful to have an "at-least-once" version of >>> FileStream sink, where a metadata directory is no longer needed. It may >>> require the implementation to go back to the old way of atomic renaming, >>> but it will also get rid of the necessity of a metadata directory, so >>> someone might find it useful. For end-to-end exactly once, people can >>> either use a limited current FileStream sink or use Data Lake products. I >>> don't see the value in making improvements to the current FileStream sink. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <wojciechin...@gmail.com> >>> wrote: >>> >>>> Hi! >>>> I raised a ticket on parametrisable output metadata path >>>> https://issues.apache.org/jira/browse/SPARK-43152. >>>> I am going to raise a PR against it and I realised, that this >>>> relatively simple change impacts on method hasMetadata(path), that would >>>> have a new meaning if we can define custom path for metadata of output >>>> files. Can you please share your opinion on how the custom output metadata >>>> path can impact on design of structured streaming? >>>> E.g. I can see one case when I set a parameter of output metadata path, >>>> run a job on output path A, stop the job, change the output path to B and >>>> hasMetadata works well. If you have any corner case in mind where the >>>> parametrised output metadata path can break something please describe it. >>>> >>>> -- >>>> Kind regards/ Pozdrawiam, >>>> Wojciech Indyk >>>> >>>