On 2010-04-05 16:54, ashokkumar.raveendi...@wipro.com wrote:
> Hi,
>       Thank you for your suggestion. I have around 500+ internet urls
> configured for crawling and crawl process is running in Amazon cloud.  I
> have already reduced my depth to 8, topN to 1000 and also increased
> fetcher threads to 150 and limited 50 urls per  host using
> generate.max.per.host property. With this configuration Generate, Fetch,
> Parse, Update completes in max 10 hrs. When comes to segment merge it
> takes lot of time. As a temporary solution I am not doing the segment
> merge and directly indexing the fetched segments. With this solution I
> am able to finish the crawl process with in 24hrs. Now I am looking for
> long term solution to optimize segment merge process.

Segment merging is not strictly necessary, unless you have a hundred
segments or so. If this step takes too much time, but still the number
of segments is well below a hundred, just don't merge them.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to