Hi Michael,
Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber:
> > in the Database all(let s say 24) pages are stored.
>
> The Database Stored 24 "URLs". That is the one URL which is Indexed
an
> the 23 URLs which are on the linked on the page an Nutch must Index
in
> the next crawl.
But this is exactly what I want in the first crawl.
The user of my application can determine a depth and
then with the help of the WebDB a Graph is beeing constructed.
So in the end it should be possible to search in all
of the presented Nodes/Pages.
Is there a way to have the same number of indexed Pages and Pages in
the WebDB,
or (what would also be ok)
Can I fetch out of the WebDB only Pages, that have been indexed?
At the Moment to build the Graph File I use this parts
for (Enumeration e = reader.pages(); e.hasMoreElements(); j++) {
To insert the Pages as Nodes and then search for links.
This for loop should now only run over only indexed pages.
I hope you understand, what I want.
Greetings from Hamburg :-)
Nils
Am Freitag, den 12.08.2005, 15:36 +0200 schrieb Michael Weber:
> > in the Database all(let s say 24) pages are stored.
>
> The Database Stored 24 "URLs". That is the one URL which is Indexed an
> the 23 URLs which are on the linked on the page an Nutch must Index in
> the next crawl.
>
> Best regards from Germany
>
> Michael
>
> Nils Hoeller schrieb:
> > Hi,
> >
> > i ve got following Problem.
> >
> > When I crawl and index a Site
> > with for example depth 1, it
> > works perfectly for the WebDB which means,
> > in the Database all(let s say 24) pages are stored.
> >
> > But when I look at the index Dir with Luke
> > I see only one page/doc (root page of crawl).
> >
> > Now when I increase the depth of the crawl
> > to 2, I have about 400 pages in the
> > WebDB and the 24 in the Index.
> >
> > So the Index seems to be made for depth-1 # of Pages?
> >
> > Why is that so ? Is that a configuration problem ?
> >
> > Thanks for your help
> >
> > Nils
> >
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers