[ 
https://issues.apache.org/jira/browse/BEAM-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-8180:
----------------------------------
    Component/s:     (was: io-go-gcp)
                 io-java-text

> Files managed by beam should have associated AVPs such as content-type and 
> content-encoding instead of merely mimeType
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-8180
>                 URL: https://issues.apache.org/jira/browse/BEAM-8180
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-text
>         Environment: Google Compute Plaform DataFlow
>            Reporter: C.J. Collier
>            Priority: Minor
>
> From customer:
>  
> {quote}We've updated our DataFlow templates to read and write with gzip 
> compression. I noticed when .gz file is written the object's metadata 
> defaults to "application/octet-stream" for Content-Type because it doesn't 
> know what it is. I would like to have each file be plain/text for 
> content-type and gzip for content-encoding. We may also add other metadata 
> key/value pairs. I can't find a way to programmatically set these and other 
> metadata values per object within DataFlow. I'm using TextIO right now and 
> just doing .withCompression. I didn't see any other functions to achieve this 
> or any DataFlow doc on it. Am I missing something?
> {quote}
>  
> The MIME type of the output file can be set by supplying your own 
> WritableByteChannelFactory to TextIO which sets the MIME type to your desired 
> value[0].
> The default WritableByteChannelFactory for TextIO is "text/plain", but when 
> "withCompression" is used, this becomes "application/octet-stream"[1][2].
> Unfortunately, FileSystems.create does not support setting a content-encoding 
> on the output channel. I will ensure that this specific point is captured in 
> the feature request, though at this point it becomes an upstream change to 
> Beam rather than a change to Dataflow.
> [0] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1175
> [1] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L874
> [2] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/MimeTypes.java
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L224



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to