damccorm opened a new issue, #20560: URL: https://github.com/apache/beam/issues/20560
CompressedFile._initialize_compressor hardcodes the compression level used when writing: self._compressor = zlib.compressobj( zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, self._gzip_mask) It would be good to be able to control this, as I have a large set of GZIP compressed files that are creating output 10x larger then the input size when writing the same data back. I've tried various monkeypatching approaches: these seem to work with the local runner, but failed when using DataflowRunner. For example: class WriteData(beam.PTransform): def __init__(self, dst): import zlib self._dst = dst def _initialize_compressor(self): self._compressor = zlib.compressobj( zlib.Z_BEST_COMPRESSION, zlib.DEFLATED, self._gzip_mask ) CompressedFile._initialize_compressor = _initialize_compressor def expand(self, p): return p | WriteToText( file_path_prefix=self._dst, file_name_suffix=".tsv.gz", compression_type="gzip", ) Imported from Jira [BEAM-11282](https://issues.apache.org/jira/browse/BEAM-11282). Original Jira may contain additional context. Reported by: JackWhelpton. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
