[ https://issues.apache.org/jira/browse/BEAM-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003388#comment-16003388 ]
ASF GitHub Bot commented on BEAM-1494: -------------------------------------- GitHub user dhalperi opened a pull request: https://github.com/apache/beam/pull/2998 [BEAM-1494] Correctly handle content-encoding in GcsFileSystem, fixing reading of such files in CompressedSource R: @jkff thoughts? CC: @chamikaramj You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/beam b1494-gcs-content-encoding Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2998.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2998 ---- commit 7ef0f8afc88b292724228fb3507e6d0c77c0b1aa Author: Dan Halperin <dhalp...@google.com> Date: 2017-05-09T19:34:04Z FileBasedSource: isSplittable should not throw This is a legacy design from Dataflow 1.x that was a poor choice. All the information needed to know whether a source is splittable should be known at source construction time, and if runtime behavior is needed it should result in conservative choices, aka false. commit 59e8e0ec27dfc498dacaaf425548681ed07a2d31 Author: Dan Halperin <dhalp...@google.com> Date: 2017-05-09T19:36:10Z CompressedSource: only use delegate reader if the file is splittable Otherwise, it's likely compressed commit b71f5dfed5b8e56dd01cca5a71e2fa72233ab363 Author: Dan Halperin <dhalp...@google.com> Date: 2017-05-09T19:36:53Z GcsFileSystem: mark content-encoded files as not seekable That is the truth (since they are actually compressed) and will result in correct data when reading from them in, e.g., TextIO ---- > GcsFileSystem should check content encoding when setting IsReadSeekEfficient > ---------------------------------------------------------------------------- > > Key: BEAM-1494 > URL: https://issues.apache.org/jira/browse/BEAM-1494 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions > Reporter: Pei He > Assignee: Daniel Halperin > > It is incorrect to set IsReadSeekEfficient true for files with content > encoding set to gzip. This is an inherited issue from GcsIOChannelFactory. > https://cloud.google.com/storage/docs/transcoding#content-type_vs_content-encoding -- This message was sent by Atlassian JIRA (v6.3.15#6346)