[
https://issues.apache.org/jira/browse/PIG-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tomas Hudik updated PIG-4533:
-----------------------------
Description:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as
they are NOT CONCATENATED FILES generated in this manner: ..."_
This is not true for tar.gz, since
# I did a test - concatenated&compress some files and processed them. The same
was done with the raw files (no compression). The results were identical
# Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop
(1 and 2) are supporting this already.
Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common).
Therefore,
# tar.bz2 should be handled by hadoop-common as well (there is no need to be
handled by Pig anymore). (I believe
https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be
removed)
# correct documentation accordingly (concatenated tar.gz, tar.bz2 are
processing correctly)
was:
Documentation (since 0.11.1 at least) says :
http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
_"Note: PigStorage and TextLoader correctly read compressed files as long as
they are NOT CONCATENATED FILES generated in this manner: ..."_
I doubt this is still true, since
1. I did a test - concatenated some files and processed them. However, all the
results were identical to ones that were produces on non-concatenated
files. Why? They should be different...
2. Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
https://issues.apache.org/jira/i#browse/HADOOP-6835 says this was fixed in
Hadoop 0.22, Hadoop 0.20 respectively. That said Hadoop (1 and 2) are
supporting this. I suppose Pig do not make compression on its own but rather
depends on hadoop-core (hadoo-common respectively) libraries.
If I'm right, the documentation should be fixed (delete the part about
concatinated compression files problems)
> support of concatenated bz2/gz files
> ------------------------------------
>
> Key: PIG-4533
> URL: https://issues.apache.org/jira/browse/PIG-4533
> Project: Pig
> Issue Type: Bug
> Components: documentation, parser
> Reporter: Tomas Hudik
> Fix For: 0.16.0
>
>
> Documentation (since 0.11.1 at least) says :
> http://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> _"Note: PigStorage and TextLoader correctly read compressed files as long as
> they are NOT CONCATENATED FILES generated in this manner: ..."_
> This is not true for tar.gz, since
> # I did a test - concatenated&compress some files and processed them. The
> same was done with the raw files (no compression). The results were identical
> # Jira's https://issues.apache.org/jira/i#browse/HADOOP-4012 and
> https://issues.apache.org/jira/i#browse/HADOOP-6835 says the concatenation
> problems were fixed in Hadoop 0.22, Hadoop 0.20 respectively. That said
> Hadoop (1 and 2) are supporting this already.
> Pig is handling tar.bz2 only (tar.gz is handled by hadoop-common).
> Therefore,
> # tar.bz2 should be handled by hadoop-common as well (there is no need to be
> handled by Pig anymore). (I believe
> https://github.com/apache/pig/tree/trunk/lib-src/bzip2/org/apache should be
> removed)
> # correct documentation accordingly (concatenated tar.gz, tar.bz2 are
> processing correctly)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)