Well, did some work on the parser and utils and URL's are nicely extracted. I 
would prefer this code even though it does not extract urls from certain 
elements. The nofollow patch in Tika works well too.

https://issues.apache.org/jira/browse/TIKA-824

On Wednesday 21 December 2011 14:51:01 Markus Jelsma wrote:
> Hi,
> 
> For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH
> returns all URL's with some meta data such as title etc. Fixes for old
> parsers such as Neko are then obsolete.
> 
> I propose to rely on Tika for all outlinks. Right now this means not all
> types are returned such as area, form and whatelse. Is this a big problem?
> Rel is also not returned but i patched Tika to do that so we can still do
> something with nofollow which is important.
> 
> Thanks

-- 
Markus Jelsma - CTO - Openindex

Reply via email to