Hi Michael,

Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber:

>  > in the Database all(let s say 24) pages are stored.
> 
> The Database Stored 24 "URLs". That is the one URL which is Indexed
an 
> the 23 URLs which are on the linked on the page an Nutch must Index
in 
> the next crawl.

But this is exactly what I want in the first crawl.

The user of my application can determine a depth and 
then with the help of the WebDB a Graph is beeing constructed.
So in the end it should be possible to search in all 
of the presented Nodes/Pages.

Is there a way to have the same number of indexed Pages and Pages in 
the WebDB, 

or (what would also be ok)

Can I fetch out of the WebDB only Pages, that have been indexed?
At the Moment to build the Graph File I use this parts

for (Enumeration e = reader.pages(); e.hasMoreElements(); j++) {

To insert the Pages as Nodes and then search for links.

This for loop should now only run over only indexed pages.

I hope you understand, what I want.

Greetings from Hamburg :-)

Nils


Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber:
>  > in the Database all(let s say 24) pages are stored.
> 
> The Database Stored 24 "URLs". That is the one URL which is Indexed an 
> the 23 URLs which are on the linked on the page an Nutch must Index in 
> the next crawl.
> 
> Best regards from Germany
> 
> Michael
> 
> Nils Hoeller schrieb:
> > Hi,
> > 
> > i ve got following Problem.
> > 
> > When I crawl and index a Site 
> > with for example depth 1, it
> > works perfectly for the WebDB which means,
> > in the Database all(let s say 24) pages are stored.
> > 
> > But when I look at the index Dir with Luke 
> > I see only one page/doc (root page of crawl).
> > 
> > Now when I increase the depth of the crawl
> > to 2, I have about 400 pages in the 
> > WebDB and the 24 in the Index.
> > 
> > So the Index seems to be made for depth-1 # of Pages?
> > 
> > Why is that so ? Is that a configuration problem ? 
> > 
> > Thanks for your help
> > 
> > Nils
> > 



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to