Dennis,

Thank you. The parse was over the second I received your mail. More than 24 
hours... I wonder if this is because I added two more parser plugins, plugins 
writing a lot to the hadoop.log file. Actually this file get usually bigger 
then 80MO every day. Can that cause performance problems?

I also have performance problems when crawling a fairly big source (+30 000 
urls). The fetching of the first 10 000 urls goes fairly rapidly, but then it 
takes forever for the last 20 000 urls. Can it be my parser plugins? The log 
file? Not enough fetching threads?

If you have any idea.

Thank you,

David



-----------------------------------------
David Poirier
E-business Consultant - Software Engineer
 
Direct: +41 (0)22 596 10 35
 
Cross Systems - Groupe Micropole Univers
Route des Acacias 45 B
1227 Carouge / Genève
Tél: +41 (0)22 308 48 60
Fax: +41 (0)22 308 48 68
 

-----Original Message-----
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: mercredi, 4. juin 2008 16:27
To: [email protected]
Subject: Re: Can I parse more than once fetched segments?

You can if you remove the crawl_parse, parse_text, and parse_data 
directories and then run the parse command.  Don't know why it would be 
taking so long.

Dennis

POIRIER David wrote:
> Hello,
> 
> Can I parse more than once fetched segments without having to fetch
> everything again?
> 
> When I first tried to use the "./bin nutch parse
> ./path/to/an/already/parsed/segment" command I got a java exception
> explaining that the segment involved had already be parsed. Indeed the
> following subdirectories could be found under the segment directory:
> 
> segment/content
> segment/crawl_fetch
> segment/crawl_generate
> segment/crawl_parse
> segment/parse_data
> segment/parse_text
> 
> To try and force the parsing process I renamed the last 3 subdirectories
> to something else and re-lunched the "./bin nutch parse" command. It has
> been running for more than 24 hours... and it is still not over.
> 
> My idea is to afterward recreate an index with the newly parsed segment.
> 
> Is this the way to do it? Isn't there a simpler, and maybe quicker, way
> to reparsed segments?
> 
> Thank you,
> 
> David

Reply via email to