[
https://issues.apache.org/jira/browse/BEAM-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109703#comment-16109703
]
Ben Chambers edited comment on BEAM-2708 at 8/1/17 8:35 PM:
------------------------------------------------------------
This looks to be a bug in the CompressedSource support for BZIP2. Specifically,
we create the stream with:
{code:java}
return Channels.newChannel(
new BZip2CompressorInputStream(Channels.newInputStream(channel)));
{code}
Which defaults to {{decompressConcatenated = false}}. As a result only the
first "stream" within the {{bz2}} file is actually read.
The fix is easy -- change that code to:
{code:java}
return Channels.newChannel(
new BZip2CompressorInputStream(Channels.newInputStream(channel), true));
{code}
But coming up with a test is a bit harder.
was (Author: bchambers):
This looks to be a bug in the CompressedSource support for BZIP2. Specifically,
we create the stream with:
{code:java}
return Channels.newChannel(
new BZip2CompressorInputStream(Channels.newInputStream(channel)));
{code}
Which defaults to {{decompressConcatenated = false}}. As a result only the
first "stream" within the {{bz2}} file is actually read.
The fix is easy -- change that code to:
{code:java}
return Channels.newChannel(
new BZip2CompressorInputStream(Channels.newInputStream(channel),
true));
{code}
But coming up with a test is a bit harder.
> Support for pbzip2 in IO
> ------------------------
>
> Key: BEAM-2708
> URL: https://issues.apache.org/jira/browse/BEAM-2708
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-extensions, sdk-py
> Reporter: Pablo Estrada
> Assignee: Ben Chambers
>
> I'm not sure which components to file this against. A user has observed that
> pbzip2 files are not being properly decompressed:
> https://stackoverflow.com/questions/45439117/google-dataflow-only-partly-uncompressing-files-compressed-with-pbzip2
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)