Thanks Cham and Robert! I think having a .withPipelineTempDirectory builder
option in the file-based sink Builders would be helpful for our use case.
I'm happy to put up a PR if there's consensus on this.

Thanks,
Claire

On Fri, Sep 10, 2021 at 1:20 PM Chamikara Jayalath <[email protected]>
wrote:

>
>
> On Fri, Sep 10, 2021 at 9:24 AM Robert Bradshaw <[email protected]>
> wrote:
>
>> Being able to (cheaply) rename files is precisely why the temporary
>> files are colocated with the final output destination. There are other
>> benefits as well like automatically inheriting the right permissions,
>> not having to worry about having sufficient disk/quota etc. in both
>> the temporary and final place, etc. I don't think it makes sense to
>> change this default. (I don't think the analogy with BigqueryIO holds,
>> as there we have no "local" place to collate with.
>>
>> We could introduce a new withPipelineTempDirectory() method that would
>> invoke withTempDirectory with the PipelineOptions.getTempLocation to
>> make this easier.
>>
>
> As others pointed out, getTempLocation can be a completely different
> bucket or maybe even a different region for some users. So changing the
> default could result in performance regressions for some users.
>
> BigQueryIO sink (in export mode)  is a bit different since we don't
> perform a renaming of files to a final location. We just write temp files
> and load them to BigQuery using load jobs.
>
> Also note that we create a uniquely named subdirectory within the temp
> directory for each run of a pipeline. So there should not be conflicts
> between different runs when using the same output directory:
> https://github.com/apache/beam/blob/95c3811312ad41f98dd5ef0674ef30ce30448746/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L525
>
> Temp output directory can be overridden today using the
> "withTempDirectory" option. I'm also OK with introducing a pipeline option
> to make this easier if needed.
>
> Thanks,
> Cham
>
>
>>
>> On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty
>> <[email protected]> wrote:
>> >
>> > Niel, that's a good point -- I don't think there's any restriction on
>> the filesystem of PipelineOptions#tempLocation, I was able to run a job on
>> DirectRunner with PipelineOptions#tempLocation set to a local path & my
>> TextIO.Write#outputDirectory set to a remote filesystem.
>> >
>> > But currently there's also no check that would catch
>> AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
>> at graph construction time, either (it will fail at runtime instead). I
>> think the FileSystems api has some logic for performing this check that we
>> could extract into a utility method and use in PTransform#expand?
>> >
>> > - Claire
>> >
>> > On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <[email protected]> wrote:
>> >>
>> >> I have heard of an intermittent fault in avroIO where two independent
>> pipelines using the same output directory deleted each others temp files
>> while they were being written.
>> >>
>> >> I cannot reproduce the problem, nor have I found a code path that
>> could cause it in fileio (with a superficial look), but it has been
>> reported more than once...
>> >>
>> >>
>> >>
>> >> Back to the original question, is it possible that the tempDirectory
>> from PipelineOptions points to a different filesystem than used for the
>> fileio output? Because this could break transforms that write to a tempfile
>> in the destination filesystem then rename the tempfile to the final output
>> filename when writing is complete.
>> >>
>> >> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <[email protected]> wrote:
>> >>>
>> >>> While this makes sense, I can also see a risk of breaking existing
>> users by changing the default like that. In addition it might be a
>> less-performant default. A file rename that takes place within the same
>> location might be performed via a fast rename operation, where otherwise it
>> would be forced to generate an expensive copy + delete. This could cause
>> strange and hard-to-debug performance problems when people upgrade to the
>> newer Beam version.
>> >>>
>> >>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <[email protected]> wrote:
>> >>>>
>> >>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada
>> >>>>
>> >>>> This proposal makes sense to me. It makes it easier for users to
>> reason about why a temp directory is chosen, and would lead to a unified
>> code across all IOs that does this.
>> >>>>
>> >>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>> [email protected]> wrote:
>> >>>>>
>> >>>>> Hi Beam devs,
>> >>>>>
>> >>>>> I have a question/proposal about the default tempDirectory setting
>> for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>> optional tempDirectory setter, and when the transforms are expanded,
>> tempDirectory will default to the value of the final output directory if
>> null [AvroIO/FileIO/TextIO].
>> >>>>>
>> >>>>> I think it would make sense to default to the value of
>> PipelineOptions#getTempLocation instead, which is accessible inside the
>> expand(PCollection<T> input) method; it seems reasonable for the user to
>> expect that their PipelineOptions#getTempLocation will be honored, and
>> additionally, their final output locations may have locks/retention
>> policies set that make the temp file renaming step fail. Plus, this pattern
>> looks like it's already being used in BigQueryIO.
>> >>>>>
>> >>>>> What do you think?
>> >>>>>
>> >>>>> Thanks!
>> >>>>> Claire
>> >>>>>
>> >>>>>
>>
>

Reply via email to