Beam is able to infer compression from file extensions for a variety of
formats, but snappy is not among them currently:

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java

Although ParquetIO and AvroIO each look to have support for snappy.

So as best I can tell, there is no current built-in support for reading
text files compressed via snappy. I think you would need to use FileIO to
match files, and then implement a custom DoFn that take the file object,
streams the contents through a snappy decompressor, and outputs one record
per line.

I imagine a PR to add snappy as a supported format in Compression.java
would be welcome.

On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <[email protected]>
wrote:

> Hi devs,
>
> We are trying to build a pipeline to read snappy compressed text files
> that contain one record per line using the Java SDK.
>
> We have tried the following to read the files:
>
> p.apply("ReadLines",
> FileIO.match().filepattern((options.getInputFilePattern())))
>         .apply(FileIO.readMatches())
>         .setCoder(SnappyCoder.of(ReadableFileCoder.of()))
>         .apply(TextIO.readFiles())
>         .apply(ParDo.of(new TransformRecord()));
>
> Is there a recommended way to decompress and read Snappy files with Beam?
>
> Thanks,
> Chris
>

Reply via email to