Hi Susam,

No in fact the 350 000 documents are divided into 356 html index. This is 
what I put in my text file.

When I ran crawl and index I did this :

        bin/nutch crawl urls    -dir crawldir -depth 5

"urls" is the name of the directory containing the text file, crawldir 
the directory where the index go andI think depth is enough for the 
type of URL I need to index. 
Is it correct ? I think it is cause I take a look into dadoop.log and the 
deeper URL are in the log. 
Maybe I should try with depth = 10... I will try and let you know if it 
works...

Thank's for your help,

Jisay




>Do you put the URLs to all 350000 documents in the text file?
>
>If yes, you can check logs/hadoop.log to see if any fetch fails.
>
>If not, may be some of the documents are too deep and increasing the
>depth value while crawling, might solve the problem.
>
>Regards,
>Susam Pal
>
>On 3/3/08, Jean-Christophe Alleman <[EMAIL PROTECTED]> wrote:
>
> Hi list !
>
> I have a problem while I index, all the documents I want to index are not 
> indexed... I have about        350 000 documents but Nutch doesn\'t index all 
> of them !
>
> I create a txt file in which I put the URL I want to index, in 
> crawl-urlfilter.txt I change MYDOMAINAME : I put what I need.
>
> What goes wrong when I index ?
>
> Please help !
>
> Jisay
>
> _________________________________________________________________
> Changez votre Live en un clic !
> http://get.live.com
_________________________________________________________________
Changez votre Live en un clic !
http://get.live.com

Reply via email to