I just committed some changes to Tika that (in theory) should ensure all URLs get extracted from HTML documents.

See https://issues.apache.org/jira/browse/TIKA-463 for details.

It would be great if somebody active in Nutch could try this out with the current suite of Nutch tests for HTML processing.

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to