Re: questions about nutch results

Dennis Kubes Tue, 02 Dec 2008 06:28:45 -0800


ripper07 wrote:

hey everyone,
ive got 3 questions.

1. Can nutch somehow return whole html pages instead of raw data segments?

The contents object in segments holds the full html for pages. It isused, amoung other things, for displaying the nutch cache.


2. Is there a property that can set the size of cached nutch result pages?
ive got a problem with this, cause i also need the pictures from the pages.

You can set the maximum size (content) downloadabe, but the nutchcrawler doesn't fetch images. If you need a crawler that will makeexact archived copies of webpage look at the heretrix crawler. Heretrixoutputs arc files (compressed gzipped appended files) and can beintegrated into Nutch using the ArcToSegments tool.


3. Is there a way i can access the results other than via apaches localhost
service? Explaining how does nutch parse the raw data segments into viewable
results could also come in handy

The NutchBean can access search servers in java code over Hadoop RPC(sockets). The nutch war gives an example of this. In the 1.0 releasethere will be the ability to access search results via XML and JSON buta servlet will still need to be deployed.

How Nutch goes from raw content to search results is a big process.Briefly we fetch the page, store it in segments, do analysis,processing, scoring, and indexing. Indexes are deployed to searchservers. Query come in, is broadcast out to search servers, serversreturn hits. Hits are matched to hit details (fields) and to summaries(content from segments). Summaries are parsed out at query time using anutch summarizer plugin. Full results are then returned to the used anddisplayed (via either web page, xml, or json).


Dennis

thx

Re: questions about nutch results

Reply via email to