I have not added any such thing in my nutch-site.xml and I have omitted -topN argument in bin/generate command.
So my question is what would be the effect in this case. I was expecting that it would be same as -topN <infinity>. So it should generate all possible URLs in the generate phase. I tried omitting topN value in my crawl script and I find that my crawl is running much faster. Earlier I had a -topN 2000 argument and it used to take 4-5 days to finish a crawl of depth 5. Now, without the topN argument, it finished a crawl of depth 5 in 6 hours. How? On 9/7/07, Rikard Lindner <[EMAIL PROTECTED]> wrote: > Now im getting a bit uncertain but i think you can add crawl.topN in your > nutch-site.xml, i couldnt find it in nutch-default either but im quite sure > it is set somerwhere! > > /Rikard > > 2007/9/6, Smith Norton <[EMAIL PROTECTED]>: > > > > Thanks for the response. What is the property name for this default > > value of topN in nutch-default.xml? > > > > On 9/6/07, Rikard Lindner <[EMAIL PROTECTED]> wrote: > > > There is a default value in nutch-default.xml > > > > > > /Rikard > > > > > > 2007/9/6, Smith Norton <[EMAIL PROTECTED]>: > > > > > > > > In the bin/generate command, if I omit the 'topN' argument, what is > > > > the behavior? > > > > > > > > Does it generate all possible URLs or does it assume a default topN > > value? > > > > > > > > I tried omitting topN value in my crawl script and I find that my > > > > crawl is running much faster. Earlier I had a -topN 2000 argument and > > > > it used to take 4-5 days to finish a crawl of depth 5. > > > > > > > > Now, without the topN argument, it finished a crawl of depth 5 in 6 > > > > hours. Can anyone explain what's going on? > > > > > > > > > >
