[ 
https://issues.apache.org/jira/browse/TIKA-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178930#comment-17178930
 ] 

Carina Antunes commented on TIKA-3169:
--------------------------------------

A small note that the docs have a bug 
[https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-Transfer-LayerCompression].

They should probably be something like: (still unclear if we should send (gzip 
or application/gzip)

---

If you want to {{gzip}} your files before sending to {{tika-server}} , add
{noformat}
gzip test_my_doc.pdf{noformat}
{noformat}
curl -T test_my_doc.pdf.gz -H "Content-Encoding: application/gzip"  
http://localhost:9998/rmeta{noformat}
 

If you want {{tika-server}}  to compress the output of the parse:
{noformat}
curl -T test_my_doc.pdf.gz -H "Accept-Encoding: gzip, deflate" 
http://localhost:9998/rmeta --compressed {noformat}
 

> rmeta and Content-Encoding application/gzip vs gzip
> ---------------------------------------------------
>
>                 Key: TIKA-3169
>                 URL: https://issues.apache.org/jira/browse/TIKA-3169
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: Carina Antunes
>            Priority: Minor
>
> If I send a pdf with `-H "Content-Encoding: application/gzip" ` to rmeta I 
> get a different result that if I send with `-H "Content-Encoding: gzip" `. 
> The first adds an object to the response array with "Content-Type": 
> "application/gzip"
> {code:java}
> [{
>  "Content-Type": "application/gzip",
>  "X-Parsed-By": [
>  "org.apache.tika.parser.DefaultParser",
>  "org.apache.tika.parser.pkg.CompressorParser"
>  ],
>  "X-TIKA:embedded_depth": "0",
>  "X-TIKA:parse_time_millis": "31"
>  },
> {
> ...
> "Content-Type": "application/pdf",
> ...
> }]{code}
> while the latter only returns the pdf object:
>  
>  
> {code:java}
> [{
> ...
> "Content-Type": "application/pdf",
> ...
> }]
> {code}
>  
> Example:
>  
> {code:java}
> $ gzip test.pdf
> $ curl -T test.pdf.gz http://localhost:9998/rmeta/text -H "Content-Encoding: 
> application/gzip"
> {code}
>  
> vs 
>  
> {code:java}
> $ curl -T test.pdf.gz http://localhost:9998/rmeta/text -H "Content-Encoding:  
> gzip" 
> {code}
> Not sure if the behaviour is intended. 
> If no header is sent the default behaviour is "application/gzip"
> {code:java}
> $ curl -T test.pdf.gz http://localhost:9998/rmeta/text  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to