[
https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12549464
]
Benjamin Reed commented on PIG-42:
----------------------------------
There are two problems with just using an empty file.
1) The signature is just too small to reliably detect the split. Misdetecting
the split isn't as easy as retrying because it usually means you get an
OutOfMemoryError are you may have already returned bad data.
2) You have to revert to relying on a extension to detect splitability. This
ends up being pretty hokey because most gzip utilities are looking for a .gz
extension. The splittable gzip format is completely compatible with existing
gzip utilities. Also, if a user puts the wrong extension splits may not happen
when they could or we may try to split files that we cannot.
Plus its really nice to be able to do a head file.gz and see right away whether
the file is splittable or not.
> Pig should be able to split Gzip files like it can split Bzip files
> -------------------------------------------------------------------
>
> Key: PIG-42
> URL: https://issues.apache.org/jira/browse/PIG-42
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Reporter: Benjamin Reed
> Attachments: gzip.patch
>
>
> It would be nice to be able to split gzip files like we can split bzip files.
> Unfortunately, we don't have a sync point for the split in the gzip format.
> Gzip file format supports the notion of concatenate gzipped files. When
> gzipped files are concatenated together they are treated as a single file. So
> to make a gzipped file splittable we can used an empty compressed file with
> some salt in the headers as a sync signature. Then we can make the gzip file
> splittable by using this sync signature between compressed segments of the
> file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.