Hi Otis - thanks for the nudge.

Hi Benson - yes, something like this would be useful.

My personal preference for how to integrate things like this into Tika is to create a ContentHandler. Then it's trivial to use for extracting body content, and you can use the TeeContentHandler to add it in parallel

See BoilerpipeContentHandler in Tika for one example of this approach. Though that code got a bit messy when I changed it to support including markup.

-- Ken

On Jan 2, 2011, at 10:55am, Otis Gospodnetic wrote:

Somehow this nice offer didn't seem to attract any responses -
http://search-lucene.com/m/ZTMKyJXNR92

+1 for this patch.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
From: Benson Margulies <[email protected]>
To: [email protected]
Sent: Thu, November 4, 2010 9:02:10 AM
Subject: Boilerpipe is nice, but what about readability?

I just coded a Java port of the arclabs 'readability' javascript code, which has a very strong reputation as a device for grabbing the useful
content from  newsy web pages.

I could contribute it to Tika, if (a) you wanted it, and (b) there was
some reasonable way to decide or configure which one to  use.


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to