Olena, To serialize HTML you can use NekoHTML, create plugin (like as parse-html), and execute "nutch parse" http://people.apache.org/~andyc/neko/doc/html/filters.html#filters.seria lize
"bin/nutch parse" will execute parse-html plugin To access non-parsed content (after fetching) you can start from ParseSegment utility as a sample -----Original Message----- From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] Sent: Monday, August 22, 2005 12:50 PM To: [email protected] Subject: Re: How to view the content of fetched pages? Hi, You have to write some Java code (the easiest way to start is to use SegmentReader) - to access Content objects stored in segments. Regards Piotr Olena Medelyan wrote: > Hi, > > I would like to use Nutch only as a (whole web) crawler, without the > indexing stage... After I've completed the fetching stage, how can I > access the database with the crawled data, in particular the texts of > the fetched pages? I tried to use segread and readdb from the command > line, unfortuately with no success. > > Cheers, Olena > > ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
