Hi Otis - thanks for the nudge.
Hi Benson - yes, something like this would be useful.
My personal preference for how to integrate things like this into Tika
is to create a ContentHandler. Then it's trivial to use for extracting
body content, and you can use the TeeContentHandler to add it in
parallel
See BoilerpipeContentHandler in Tika for one example of this approach.
Though that code got a bit messy when I changed it to support
including markup.
-- Ken
On Jan 2, 2011, at 10:55am, Otis Gospodnetic wrote:
Somehow this nice offer didn't seem to attract any responses -
http://search-lucene.com/m/ZTMKyJXNR92
+1 for this patch.
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ----
From: Benson Margulies <[email protected]>
To: [email protected]
Sent: Thu, November 4, 2010 9:02:10 AM
Subject: Boilerpipe is nice, but what about readability?
I just coded a Java port of the arclabs 'readability' javascript
code,
which has a very strong reputation as a device for grabbing the
useful
content from newsy web pages.
I could contribute it to Tika, if (a) you wanted it, and (b) there
was
some reasonable way to decide or configure which one to use.
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g