[
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791101#action_12791101
]
Richard Ding commented on PIG-1110:
-----------------------------------
bq. 1. If you worry about the API compatibility of PigStorage() since
PigStorage() is the default LoadFunc of Pig, there's another option that we can
provide another LoadFunc having the ability of compression, I mean we can
create a new LoadFunc such as Bz2PigStorage().
I like this idea better. The Bz2PigStorage extends PigStorage and just set the
Hadoop compressor in its constructor. If PigStorage is used, then the file
extension determines the codec.
bq. 2. Actually the file name in Store statement is the folder name not the
file name, we will get part-00000.bz2 under this folder. The part-00000.bz2 is
the real file which is consumed by hadoop. Hadoop will check the file name
rather the folder name to determine the compression codec.
You're right. But if you copy a .bz file from local file system to hdfs, then
it won't be recognized as a bzip file by hadoop TextInputFormat. The problem is
that hadoop doesn't read header to determine the file type, but rely on the
file extension.
> Handle compressed file formats -- Gz, BZip with the new proposal
> ----------------------------------------------------------------
>
> Key: PIG-1110
> URL: https://issues.apache.org/jira/browse/PIG-1110
> Project: Pig
> Issue Type: Sub-task
> Reporter: Richard Ding
> Assignee: Richard Ding
> Attachments: PIG-1110.patch, PIG_1110_Jeff.patch
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.