TIka has boilerpipe, which is not bad for web pages in general. I have a port of readability, which is better than boilerpipe for news articles in particular. It seems to me that I should investigate if Tika has room for both.
On Thu, Nov 11, 2010 at 4:04 PM, Ted Dunning <[email protected]> wrote: > I believe that this is included in Tika now (according to Ken Krugler) > > On Thu, Nov 11, 2010 at 12:37 PM, Isabel Drost <[email protected]> wrote: > >> ... >> >> As a side note - a project with similar goals was mentioned on the Lucene >> mailing lists a while ago: http://code.google.com/p/boilerpipe/ >> >> Cheers, >> Isabel >> >
