[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791101#action_12791101
 ] 

Richard Ding commented on PIG-1110:
-----------------------------------

bq. 1. If you worry about the API compatibility of PigStorage() since 
PigStorage() is the default LoadFunc of Pig, there's another option that we can 
provide another LoadFunc having the ability of compression, I mean we can 
create a new LoadFunc such as Bz2PigStorage().

I like this idea better. The Bz2PigStorage extends PigStorage and just set the 
Hadoop compressor in its constructor. If PigStorage is used, then the file 
extension determines the codec.

bq. 2. Actually the file name in Store statement is the folder name not the 
file name, we will get part-00000.bz2 under this folder. The part-00000.bz2 is 
the real file which is consumed by hadoop. Hadoop will check the file name 
rather the folder name to determine the compression codec.

You're right. But if you copy a .bz file from local file system to hdfs, then 
it won't be recognized as a bzip file by hadoop TextInputFormat. The problem is 
that hadoop doesn't read header to determine the file type, but rely on the 
file extension.


> Handle compressed file formats -- Gz, BZip with the new proposal
> ----------------------------------------------------------------
>
>                 Key: PIG-1110
>                 URL: https://issues.apache.org/jira/browse/PIG-1110
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Richard Ding
>            Assignee: Richard Ding
>         Attachments: PIG-1110.patch, PIG_1110_Jeff.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to