[
https://issues.apache.org/jira/browse/JENA-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058668#comment-14058668
]
Rob Vesse commented on JENA-744:
--------------------------------
Actually having gone and read the spec I take back that point about the size
limit. The spec (http://tools.ietf.org/html/rfc1952) says the following:
{quote}
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
{quote}
So the uncompressed input may be greater than 4GB in size and the size stored
is the size modulo 2^32 and notably this is present in the trailer of the file
not the header so decompression should not even see this value until the end of
decompression.
Therefore any truncation that happens must be as a result of the inaccurate
length in the actual compressed data blocks not due to the Gzip container itself
> Error importing from large gzip
> -------------------------------
>
> Key: JENA-744
> URL: https://issues.apache.org/jira/browse/JENA-744
> Project: Apache Jena
> Issue Type: Bug
> Components: TDB
> Reporter: Michael Kozakov
> Attachments: gzip.png
>
>
> gzip has a documented bug:
> http://www.freebsd.org/cgi/man.cgi?query=gzip#end
> "According to RFC 1952, the recorded file size is stored in a 32-bit inte-
> ger, therefore, it can not represent files larger than 4GB. This
> limita-
> tion also applies to -l option of gzip utility."
> As a result, a 28gb compressed gz shows that the uncompressed size is 1.6gb.
> (screenshot attached)
> It seems like tdbloader relies on this information to know when to stop
> importing, and as a result, the imported database is incomplete. As a
> walkaround, I have to extract the archive before using tdbloader to import
> the database, otherwise it will be missing the majority of items.
--
This message was sent by Atlassian JIRA
(v6.2#6252)