ripper07 wrote:
hey everyone, ive got 3 questions. 1. Can nutch somehow return whole html pages instead of raw data segments?
The contents object in segments holds the full html for pages. It is used, amoung other things, for displaying the nutch cache.
2. Is there a property that can set the size of cached nutch result pages? ive got a problem with this, cause i also need the pictures from the pages.
You can set the maximum size (content) downloadabe, but the nutch crawler doesn't fetch images. If you need a crawler that will make exact archived copies of webpage look at the heretrix crawler. Heretrix outputs arc files (compressed gzipped appended files) and can be integrated into Nutch using the ArcToSegments tool.
3. Is there a way i can access the results other than via apaches localhost service? Explaining how does nutch parse the raw data segments into viewable results could also come in handy
The NutchBean can access search servers in java code over Hadoop RPC (sockets). The nutch war gives an example of this. In the 1.0 release there will be the ability to access search results via XML and JSON but a servlet will still need to be deployed.
How Nutch goes from raw content to search results is a big process. Briefly we fetch the page, store it in segments, do analysis, processing, scoring, and indexing. Indexes are deployed to search servers. Query come in, is broadcast out to search servers, servers return hits. Hits are matched to hit details (fields) and to summaries (content from segments). Summaries are parsed out at query time using a nutch summarizer plugin. Full results are then returned to the used and displayed (via either web page, xml, or json).
Dennis
thx
