Beam is able to infer compression from file extensions for a variety of formats, but snappy is not among them currently:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java Although ParquetIO and AvroIO each look to have support for snappy. So as best I can tell, there is no current built-in support for reading text files compressed via snappy. I think you would need to use FileIO to match files, and then implement a custom DoFn that take the file object, streams the contents through a snappy decompressor, and outputs one record per line. I imagine a PR to add snappy as a supported format in Compression.java would be welcome. On Wed, Apr 22, 2020 at 1:16 PM Christopher Larsen <[email protected]> wrote: > Hi devs, > > We are trying to build a pipeline to read snappy compressed text files > that contain one record per line using the Java SDK. > > We have tried the following to read the files: > > p.apply("ReadLines", > FileIO.match().filepattern((options.getInputFilePattern()))) > .apply(FileIO.readMatches()) > .setCoder(SnappyCoder.of(ReadableFileCoder.of())) > .apply(TextIO.readFiles()) > .apply(ParDo.of(new TransformRecord())); > > Is there a recommended way to decompress and read Snappy files with Beam? > > Thanks, > Chris >
