[
https://issues.apache.org/jira/browse/BEAM-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959330#comment-16959330
]
Sascha Brawer commented on BEAM-683:
------------------------------------
Note that bzip2 compression blocks are not necessarily aligned with line
breaks, so the split point might be in the middle of an input line. However,
Apache Hadoop can split bzip2 input files; perhaps you can look at their code
for inspiration.
Making Beam split bzip2 input would be super useful for processing Wikidata.org
dumps. See eg.
[here|[https://stackoverflow.com/questions/52755347/google-dataflow-creates-only-one-worker-for-large-bz2-file]]
where someone ran into problems.
> Make BZIP compressed files splittable
> --------------------------------------
>
> Key: BEAM-683
> URL: https://issues.apache.org/jira/browse/BEAM-683
> Project: Beam
> Issue Type: Wish
> Components: sdk-java-core, sdk-py-core
> Reporter: Tim Sears
> Priority: Minor
> Original Estimate: 10h
> Remaining Estimate: 10h
>
> Bzip2 is compressed as blocks, so it should be possible to do dynamic
> splitting. To do this: Seek to a location in the bzip, then keep seeking
> until you find the 6 byte block-start sequence 0x314159265359 (which is the
> 12 digit approximation of pi). You can use a bzip2 decompressor from that
> point onwards.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)