Re: No filesystem found for scheme s3 using FileIO.write()

Maximilian Michels Wed, 25 Sep 2019 08:36:31 -0700

Hey Preston,

I just wrote a reply on the user mailing list. Copying the reply herejust in case:


----

Your observation seems to be correct. There is an issue with the filesystem registration.

The two types of errors you are seeing, as well as the successful run,are just due to the different structure of the generated transforms. TheFlink scheduler will distribute them differently, which results in somepipelines being placed on task managers which happen to execute theFileSystems initialization code and others not.

There is a quick fix to at least initialize the file system in case ithas not been initialized, by adding the loading code here:https://github.com/apache/beam/blob/948c6fae909685e09d36b23be643182b34c8df25/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L463

However, there we do not have the pipeline options available, whichprevents any configuration. The problem is that the error occurs in thecoder used in a native Flink operation which does not even run user code.

I believe the only way fix this is to ship the FileSystemsinitialization code in CoderTypeSerializer where we are sure to executeit in time for any coders which depend on it.


Could you file an issue? I'd be happy to fix this then.

Thanks,
Max

----

On 23.09.19 08:04, Koprivica,Preston Blake wrote:

Hello everyone. This is a cross-post from the users list. It didn’t getmuch traction, so I thought I’d move over to the dev group, since thisseems like it might be a an issue with initialization in the FlinkRunner(apologies in advance if I missed something silly).
I’m getting the following error when attempting to use the FileIO apis(beam-2.15.0) and integrating with AWS S3. I have setup thePipelineOptions with all the relevant AWS options, so the filesystemregistry **should* *be properly seeded by the time the graph is compiledand executed:
java.lang.IllegalArgumentException: No filesystem found for scheme s3
atorg.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
atorg.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
     at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)
atorg.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:83)
atorg.apache.beam.sdk.transforms.join.UnionCoder.decode(UnionCoder.java:32)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
atorg.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
atorg.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
atorg.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
atorg.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
atorg.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
atorg.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
atorg.apache.flink.runtime.operators.FlatMapDriver.run(FlatMapDriver.java:107)
     at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
atorg.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)

     at java.lang.Thread.run(Thread.java:748)

For reference, the write code resembles this:

FileIO.Write<?, GenericRecord> write = FileIO.<GenericRecord>write()

                 .via(ParquetIO.sink(schema))
.to(options.getOutputDir()). // will be something like:s3://<bucket>/<path>
                 .withSuffix(".parquet");

records.apply(String.format("Write(%s)", options.getOutputDir()), write);
The issue does not appear to be related to ParquetIO.sink(). I am ableto reliably reproduce the issue using JSON formatted records andTextIO.sink(), as well.
Just trying some different knobs, I went ahead and set the following option:

         write = write.withNoSpilling();
This actually seemed to fix the issue, only to have it reemerge as Iscaled up the data set size. The stack trace, while very similar, reads:
java.lang.IllegalArgumentException: No filesystem found for scheme s3
atorg.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
atorg.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1149)
atorg.apache.beam.sdk.io.FileBasedSink$FileResultCoder.decode(FileBasedSink.java:1105)
     at org.apache.beam.sdk.coders.Coder.decode(Coder.java:159)

     at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:82)

     at org.apache.beam.sdk.coders.KvCoder.decode(KvCoder.java:36)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:543)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:534)
atorg.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.decode(WindowedValue.java:480)
atorg.apache.beam.runners.flink.translation.types.CoderTypeSerializer.deserialize(CoderTypeSerializer.java:93)
atorg.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
atorg.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
atorg.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:72)
atorg.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:47)
atorg.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
atorg.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:94)
     at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:503)
atorg.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:368)
     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)

     at java.lang.Thread.run(Thread.java:748)
I’ll be interested to hear some theories on the differences/similaritiesin the stacks. And lastly, I tried adding the following deprecatedoption (with and without the withNoSpilling() option):
write = write.withIgnoreWindowing();
This seemed to fix the issue altogether but aside from having to rely ona deprecated feature, there is the bigger issue of why?
In reading through some of the source, it seems a common pattern to haveto manually register the pipeline options to seed the filesystemregistry during the setup part of the operator lifecycle,e.g.:https://github.com/apache/beam/blob/release-2.15.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java#L304-L313<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Frelease-2.15.0%2Frunners%2Fflink%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftranslation%2Fwrappers%2Fstreaming%2FDoFnOperator.java%23L304-L313&data=02%7C01%7CPreston.B.Koprivica%40cerner.com%7Cca33cc0168834f2d327708d73d463c69%7Cfbc493a80d244454a815f4ca58e8c09d%7C0%7C0%7C637045244185938359&sdata=lEH5yuFU3L%2FzT7Qy8m1pTahFG%2FH20AUh9rfQjVfYijI%3D&reserved=0>
Is it possible that I have hit upon a couple scenarios where that hasnot taken place? Unfortunately, I’m not yet at a position to suggest afix, but I’m guessing there’s some missing initialization code in one ormore of the batch operators. If this is indeed a legitimate issue, I’llbe happy to log an issue, but I’ll hold off until the community gets achance to look at it.
Thanks,

  * Preston
CONFIDENTIALITY NOTICE This message and any included attachments arefrom Cerner Corporation and are intended only for the addressee. Theinformation contained in this message is confidential and may constituteinside or non-public information under international, federal, or statesecurities laws. Unauthorized forwarding, printing, copying,distribution, or use of such information is strictly prohibited and maybe unlawful. If you are not the addressee, please promptly delete thismessage and notify the sender of the delivery error by e-mail or you maycall Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1)(816)221-1024.

Re: No filesystem found for scheme s3 using FileIO.write()

Reply via email to