+1 from me -- those 3 Tika content handlers should take care of it... Cheers, Chris
On Dec 21, 2011, at 6:51 AM, Markus Jelsma wrote: > Hi, > > For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH > returns all URL's with some meta data such as title etc. Fixes for old > parsers > such as Neko are then obsolete. > > I propose to rely on Tika for all outlinks. Right now this means not all > types > are returned such as area, form and whatelse. Is this a big problem? Rel is > also not returned but i patched Tika to do that so we can still do something > with nofollow which is important. > > Thanks > > -- > Markus Jelsma - CTO - Openindex ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

