sebastian-nagel opened a new pull request, #850:
URL: https://github.com/apache/nutch/pull/850
- Upgrade to shaded Tika packages 3.1.0.0 provided by Tim Allison.
The shaded packages are required to avoid version conflicts when running
in distributed mode caused by incompatible versions of the commons-io jar
shipped with Hadoop and required by Tika, cf. NUTCH-2959.
- Add "text/javascript" as MIME type supported by "parse-js". Note: This
fixes the parse-js unit tests. Tika 3.1.0 identifies the Javascript test
document as "text/javascript" instead of "application/javascript".
Todo:
- [ ] fix unit test o.a.n.parse.tika.TestDOMContentUtils : duplicated
outlinks
- [ ] fix unit test o.a.n.parse.tika.TestHtmlParser : parsing a UTF-16
encoded HTML fails partially (no title, no keywords). Note: it might be that
there are two BOMs in the test document.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]