[
https://issues.apache.org/jira/browse/SOLR-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070259#comment-16070259
]
Andrew Lundgren commented on SOLR-10981:
----------------------------------------
As this doesn't use the content-type header, it uses the content-encoding
header, it does not interfere with the existing content-type header usage.
In the patched ContentStreamBase, line 84 the content type is taken from the
connection. As this is not changed by the gzip contentEncoding header on line
89; code using the content stream is unaffected. If the contentEncoding is not
set, then the code will also detect if the file ends with ".gz". This could be
dropped, though it seemed a reasonable usage.
In the patched ContentStreamBase, lines 117, 121 the content type of a
FileStream is determined by the first character found in the stream. As the
stream is already opened and the gunzip stream applied over the input stream,
the code that determines the content type is unaffected. The FileStream will
work with any existing format that is gzipped as it determines the content type
based on the first character of the decompressed stream. (Attached is a new
patch that causes this method to use the getStream method on 117 rather than
open the file itself applying the gzip layer)
I agree that using generic {{Content-Type: application/gzip}} would lead to
confusion. To me, the gzip layer is the encoding of the content, not the type
itself. By using the encoding type you are able to handle the gzip at a lower
layer, and keep all of your content-type support untouched.
The current handling of a FileStream and a file:// URL are inconsistent, as the
FileStream tries to guess the content type based on the first character. The
file:// URL uses mime-types to determine the content. They seemingly should
be consistent, though I did not try to make them consistent, as the FileStream
implementation point's out, it's implementation is buggy.
> Allow update to load gzip files
> --------------------------------
>
> Key: SOLR-10981
> URL: https://issues.apache.org/jira/browse/SOLR-10981
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrJ
> Affects Versions: 6.6
> Reporter: Andrew Lundgren
> Labels: patch
> Fix For: 4.10.4, 6.6, master (7.0)
>
> Attachments: SOLR-10981.patch
>
>
> We currently import large CSV files. We store them in gzip files as they
> compress at around 80%.
> To import them we must gunzip them and then import them. After that we no
> longer need the decompressed files.
> This patch allows directly opening either URL, or local files that are
> gzipped.
> For URLs, to determine if the file is gzipped, it will check the content
> encoding=="gzip" or if the file ends in ".gz"
> For files, if the file ends in ".gz" then it will assume the file is gzipped.
> I have tested the patch with 4.10.4, 6.6.0 and master from git.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]