Re: [Nutch-dev] Perfomance problems and segmenting

JoostRuiter Tue, 24 Apr 2007 04:23:55 -0700

I got some additional info from our developer:

"I never
had much luck with the merge tools but you might post this snippit from
your log to the board:


2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 21:28:09,031 WARN  mapred.LocalJobRunner - job_gai7an
java.lang.OutOfMemoryError: Java heap space

Which might give them a little more info since it tells them when."



JoostRuiter wrote:
> 
> Hey guys,
> 
> one more addition, we're not using DFS. We got a single XP box with NFTS
> (so no distributed index).
> 
> Hope this helps, greetings..
> 
> And for some strange reason we got the following error after slicing the
> segments into 50K url pieces:
> 
> $ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/
> -slice 50000
> Merging 1 segments to arscrminternal/outseg/20070423163605
> SegmentMerger:   adding arscrminternal/segments/20070421110321
> SegmentMerger: using segment data from: content crawl_generate crawl_fetch
> crawl_parse parse_data parse_text
> Slice size: 50000 URLs.
> Slice size: 50000 URLs.
> Slice size: 50000 URLs.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>         at
> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
>         at
> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)
> 
> 
> We thought making smaller chunks would help the perfomance, but we didnt
> even come around to test it beacause of the above error, any ideas?
> 
> 
> 
> JoostRuiter wrote:
>> 
>> Ok thanks for all your input guys! I`ll discuss this with my co-worker.
>> Dennis, what more information do you need?
>> 
>> Thanks everyone!
>> 
>> 
>> Briggs wrote:
>>> 
>>> One more thing...
>>> 
>>> Are you using a distributed index?  If this is so, you do not want to
>>> do this; indexes should be local to the machine that is being
>>> searched.
>>> 
>>> On 4/23/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>>>> Without more information this sounds like your tomcat search
>>>> nutch-site.xml file is setup to use the DFS rather than the local file
>>>> system.  Remember that processing jobs occurs on the DFS but for
>>>> searching, indexes are best moved to the local file system.
>>>>
>>>> Dennis Kubes
>>>>
>>>> JoostRuiter wrote:
>>>> > Hi All,
>>>> >
>>>> > First off, I'm quite the noob when it comes to Nutch, so don't bash
>>>> me if
>>>> > the following is an enormously stupid question.
>>>> >
>>>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM
>>>> and a
>>>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of
>>>> 15gig.
>>>> >
>>>> >
>>>> > Performance is really poor, if we do get search results it will take
>>>> > multiple minutes. When the query is longer we are getting the
>>>> following:
>>>> >
>>>> > "java.lang.OutOfMemoryError: Java heap memory"
>>>> >
>>>> > What we have tried to improve on this:
>>>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
>>>> > - Set io.map.index.skip to 8
>>>> > - Set indexer.termIndexInterval to 1024
>>>> > - Cluster with Hadoop (4 nodes to search)
>>>> >
>>>> > Any ideas? Missing information? Please let me know, this is my
>>>> graduation
>>>> > internship and I would really like to get a good grade ;)
>>>>
>>> 
>>> 
>>> -- 
>>> "Conscious decisions by conscious minds are what make reality real"
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10158788
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Perfomance problems and segmenting

Reply via email to