Olena,

To serialize HTML you can use NekoHTML, create plugin (like as
parse-html), and execute "nutch parse"
http://people.apache.org/~andyc/neko/doc/html/filters.html#filters.seria
lize

"bin/nutch parse" will execute parse-html plugin
To access non-parsed content (after fetching) you can start from
ParseSegment utility as a sample



-----Original Message-----
From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 22, 2005 12:50 PM
To: [email protected]
Subject: Re: How to view the content of fetched pages?


Hi,
You have to write some Java code (the easiest way to start is to use 
SegmentReader) - to access Content objects stored in segments. Regards
Piotr Olena Medelyan wrote:
> Hi,
> 
> I would like to use Nutch only as a (whole web) crawler, without the 
> indexing stage... After I've completed the fetching stage, how can I 
> access the database with the crawled data, in particular the texts of 
> the fetched pages? I tried to use segread and readdb from the command 
> line, unfortuately with no success.
> 
> Cheers, Olena
> 
> 





-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to