damccorm opened a new issue, #20560:
URL: https://github.com/apache/beam/issues/20560

   CompressedFile._initialize_compressor hardcodes the compression level used 
when writing:
   
    
   self._compressor = zlib.compressobj(
             zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, self._gzip_mask)
    
   It would be good to be able to control this, as I have a large set of GZIP 
compressed files that are creating output 10x larger then the input size when 
writing the same data back.
    
   I've tried various monkeypatching approaches: these seem to work with the 
local runner, but failed when using DataflowRunner. For example:
    
   class WriteData(beam.PTransform):
       def __init__(self, dst):
           import zlib
   
           self._dst = dst
   
           def _initialize_compressor(self):
               self._compressor = zlib.compressobj(
                   zlib.Z_BEST_COMPRESSION, zlib.DEFLATED, self._gzip_mask
               )
   
           CompressedFile._initialize_compressor = _initialize_compressor
   
       def expand(self, p):
           return p | WriteToText(
               file_path_prefix=self._dst,
               file_name_suffix=".tsv.gz",
               compression_type="gzip",
           )
   
   Imported from Jira 
[BEAM-11282](https://issues.apache.org/jira/browse/BEAM-11282). Original Jira 
may contain additional context.
   Reported by: JackWhelpton.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to