Hi Susam,
No in fact the 350 000 documents are divided into 356 html index. This is
what I put in my text file.
When I ran crawl and index I did this :
bin/nutch crawl urls -dir crawldir -depth 5
"urls" is the name of the directory containing the text file, crawldir
the directory where the index go andI think depth is enough for the
type of URL I need to index.
Is it correct ? I think it is cause I take a look into dadoop.log and the
deeper URL are in the log.
Maybe I should try with depth = 10... I will try and let you know if it
works...
Thank's for your help,
Jisay
>Do you put the URLs to all 350000 documents in the text file?
>
>If yes, you can check logs/hadoop.log to see if any fetch fails.
>
>If not, may be some of the documents are too deep and increasing the
>depth value while crawling, might solve the problem.
>
>Regards,
>Susam Pal
>
>On 3/3/08, Jean-Christophe Alleman <[EMAIL PROTECTED]> wrote:
>
> Hi list !
>
> I have a problem while I index, all the documents I want to index are not
> indexed... I have about 350 000 documents but Nutch doesn\'t index all
> of them !
>
> I create a txt file in which I put the URL I want to index, in
> crawl-urlfilter.txt I change MYDOMAINAME : I put what I need.
>
> What goes wrong when I index ?
>
> Please help !
>
> Jisay
>
> _________________________________________________________________
> Changez votre Live en un clic !
> http://get.live.com
_________________________________________________________________
Changez votre Live en un clic !
http://get.live.com