C.J. Collier created BEAM-8180:
----------------------------------

             Summary: Files managed by beam should have associated AVPs such as 
content-type and content-encoding instead of merely mimeType
                 Key: BEAM-8180
                 URL: https://issues.apache.org/jira/browse/BEAM-8180
             Project: Beam
          Issue Type: Improvement
          Components: io-go-gcp
         Environment: Google Compute Plaform DataFlow
            Reporter: C.J. Collier


>From customer:

 
{quote}We've updated our DataFlow templates to read and write with gzip 
compression. I noticed when .gz file is written the object's metadata defaults 
to "application/octet-stream" for Content-Type because it doesn't know what it 
is. I would like to have each file be plain/text for content-type and gzip for 
content-encoding. We may also add other metadata key/value pairs. I can't find 
a way to programmatically set these and other metadata values per object within 
DataFlow. I'm using TextIO right now and just doing .withCompression. I didn't 
see any other functions to achieve this or any DataFlow doc on it. Am I missing 
something?
{quote}
 

The MIME type of the output file can be set by supplying your own 
WritableByteChannelFactory to TextIO which sets the MIME type to your desired 
value[0].

The default WritableByteChannelFactory for TextIO is "text/plain", but when 
"withCompression" is used, this becomes "application/octet-stream"[1][2].

Unfortunately, FileSystems.create does not support setting a content-encoding 
on the output channel. I will ensure that this specific point is captured in 
the feature request, though at this point it becomes an upstream change to Beam 
rather than a change to Dataflow.

[0] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1175

[1] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L874

[2] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MimeTypes.java

[3] 
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L224



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to