Well, did some work on the parser and utils and URL's are nicely extracted. I would prefer this code even though it does not extract urls from certain elements. The nofollow patch in Tika works well too.
https://issues.apache.org/jira/browse/TIKA-824 On Wednesday 21 December 2011 14:51:01 Markus Jelsma wrote: > Hi, > > For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH > returns all URL's with some meta data such as title etc. Fixes for old > parsers such as Neko are then obsolete. > > I propose to rely on Tika for all outlinks. Right now this means not all > types are returned such as area, form and whatelse. Is this a big problem? > Rel is also not returned but i patched Tika to do that so we can still do > something with nofollow which is important. > > Thanks -- Markus Jelsma - CTO - Openindex

