[
https://issues.apache.org/jira/browse/TIKA-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mohsen updated TIKA-2724:
-------------------------
Description:
When the {{fileUrl}} passed to the Tika server results in a 3xx http status
code, Tika happily returns a 200 response.
*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures
and -enableFileUrl options. Then send a fileUrl}} to the server that returns a
300 error code. Here is a sample curl session:
{code:java}
$ curl -v google.com
* Rebuilt URL to: google.com/
* Trying 216.58.216.142...
* TCP_NODELAY set
* Connected to google.com (216.58.216.142) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Wed, 05 Sep 2018 15:31:51 GMT
< Expires: Fri, 05 Oct 2018 15:31:51 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact
$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html;
charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History |
Settings | Sign in\n\n\n \n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage
tools\n\n\n\n\nGoogle offered in: Fran�ais \n\n\nAdvertising�ProgramsBusiness
Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy -
Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
I am using Tika server to pull files from S3 and parse them, but upon a
redirect request, it neither redirects nor returns an error code.
See https://docs.aws.amazon.com/AmazonS3/latest/dev/Redirects.html
was:
When the {{fileUrl}} passed to the Tika server results in a 3xx http status
code, Tika happily returns a 200 response.
*How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures
and -enableFileUrl options. Then send a fileUrl}} to the server that returns a
300 error code. Here is a sample curl session:
{code:java}
$ curl -v google.com
* Rebuilt URL to: google.com/
* Trying 216.58.216.142...
* TCP_NODELAY set
* Connected to google.com (216.58.216.142) port 80 (#0)
> GET / HTTP/1.1
> Host: google.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://www.google.com/
< Content-Type: text/html; charset=UTF-8
< Date: Wed, 05 Sep 2018 15:31:51 GMT
< Expires: Fri, 05 Oct 2018 15:31:51 GMT
< Cache-Control: public, max-age=2592000
< Server: gws
< Content-Length: 219
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
<
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
* Connection #0 to host google.com left intact
$ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9998 (#0)
> PUT /rmeta/text HTTP/1.1
> Host: localhost:9998
> User-Agent: curl/7.54.0
> Accept: */*
> fileUrl:http://google.com
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Wed, 05 Sep 2018 15:25:12 GMT
< Transfer-Encoding: chunked
< Server: Jetty(8.y.z-SNAPSHOT)
<
* Connection #0 to host localhost left intact
[{"Content-Encoding":"UTF-8","Content-Type":"text/html;
charset\u003dUTF-8","Content-Type-Hint":"text/html;
charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
Search Images Maps Play YouTube News Gmail Drive More »\nWeb History |
Settings | Sign in\n\n\n \n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage
tools\n\n\n\n\nGoogle offered in: Fran�ais \n\n\nAdvertising�ProgramsBusiness
Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy -
Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
> Tika does not recognize http 3xx error codes when passed fileUrl
> ----------------------------------------------------------------
>
> Key: TIKA-2724
> URL: https://issues.apache.org/jira/browse/TIKA-2724
> Project: Tika
> Issue Type: Bug
> Components: server
> Affects Versions: 1.18
> Reporter: Mohsen
> Priority: Major
>
> When the {{fileUrl}} passed to the Tika server results in a 3xx http status
> code, Tika happily returns a 200 response.
> *How to reproduce the issue*: Run tika server with {{-enableUnsecureFeatures
> and -enableFileUrl options. Then send a fileUrl}} to the server that returns
> a 300 error code. Here is a sample curl session:
> {code:java}
> $ curl -v google.com
> * Rebuilt URL to: google.com/
> * Trying 216.58.216.142...
> * TCP_NODELAY set
> * Connected to google.com (216.58.216.142) port 80 (#0)
> > GET / HTTP/1.1
> > Host: google.com
> > User-Agent: curl/7.54.0
> > Accept: */*
> >
> < HTTP/1.1 301 Moved Permanently
> < Location: http://www.google.com/
> < Content-Type: text/html; charset=UTF-8
> < Date: Wed, 05 Sep 2018 15:31:51 GMT
> < Expires: Fri, 05 Oct 2018 15:31:51 GMT
> < Cache-Control: public, max-age=2592000
> < Server: gws
> < Content-Length: 219
> < X-XSS-Protection: 1; mode=block
> < X-Frame-Options: SAMEORIGIN
> <
> <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
> <TITLE>301 Moved</TITLE></HEAD><BODY>
> <H1>301 Moved</H1>
> The document has moved
> <A HREF="http://www.google.com/">here</A>.
> </BODY></HTML>
> * Connection #0 to host google.com left intact
> $ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
> * Trying ::1...
> * TCP_NODELAY set
> * Connected to localhost (::1) port 9998 (#0)
> > PUT /rmeta/text HTTP/1.1
> > Host: localhost:9998
> > User-Agent: curl/7.54.0
> > Accept: */*
> > fileUrl:http://google.com
> >
> < HTTP/1.1 200 OK
> < Content-Type: application/json
> < Date: Wed, 05 Sep 2018 15:25:12 GMT
> < Transfer-Encoding: chunked
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Connection #0 to host localhost left intact
> [{"Content-Encoding":"UTF-8","Content-Type":"text/html;
> charset\u003dUTF-8","Content-Type-Hint":"text/html;
> charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n
> Search Images Maps Play YouTube News Gmail Drive More »\nWeb History |
> Settings | Sign in\n\n\n \n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage
> tools\n\n\n\n\nGoogle offered in: Fran�ais \n\n\nAdvertising�ProgramsBusiness
> Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy -
> Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]{code}
>
> I am using Tika server to pull files from S3 and parse them, but upon a
> redirect request, it neither redirects nor returns an error code.
> See https://docs.aws.amazon.com/AmazonS3/latest/dev/Redirects.html
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)