[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791101#action_12791101 ]
Richard Ding commented on PIG-1110: ----------------------------------- bq. 1. If you worry about the API compatibility of PigStorage() since PigStorage() is the default LoadFunc of Pig, there's another option that we can provide another LoadFunc having the ability of compression, I mean we can create a new LoadFunc such as Bz2PigStorage(). I like this idea better. The Bz2PigStorage extends PigStorage and just set the Hadoop compressor in its constructor. If PigStorage is used, then the file extension determines the codec. bq. 2. Actually the file name in Store statement is the folder name not the file name, we will get part-00000.bz2 under this folder. The part-00000.bz2 is the real file which is consumed by hadoop. Hadoop will check the file name rather the folder name to determine the compression codec. You're right. But if you copy a .bz file from local file system to hdfs, then it won't be recognized as a bzip file by hadoop TextInputFormat. The problem is that hadoop doesn't read header to determine the file type, but rely on the file extension. > Handle compressed file formats -- Gz, BZip with the new proposal > ---------------------------------------------------------------- > > Key: PIG-1110 > URL: https://issues.apache.org/jira/browse/PIG-1110 > Project: Pig > Issue Type: Sub-task > Reporter: Richard Ding > Assignee: Richard Ding > Attachments: PIG-1110.patch, PIG_1110_Jeff.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.