I got some additional info from our developer: "I never had much luck with the merge tools but you might post this snippit from your log to the board:
2007-04-23 20:01:56,656 INFO segment.SegmentMerger - Slice size: 50000 URLs. 2007-04-23 20:01:56,656 INFO segment.SegmentMerger - Slice size: 50000 URLs. 2007-04-23 21:28:09,031 WARN mapred.LocalJobRunner - job_gai7an java.lang.OutOfMemoryError: Java heap space Which might give them a little more info since it tells them when." JoostRuiter wrote: > > Hey guys, > > one more addition, we're not using DFS. We got a single XP box with NFTS > (so no distributed index). > > Hope this helps, greetings.. > > And for some strange reason we got the following error after slicing the > segments into 50K url pieces: > > $ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/ > -slice 50000 > Merging 1 segments to arscrminternal/outseg/20070423163605 > SegmentMerger: adding arscrminternal/segments/20070421110321 > SegmentMerger: using segment data from: content crawl_generate crawl_fetch > crawl_parse parse_data parse_text > Slice size: 50000 URLs. > Slice size: 50000 URLs. > Slice size: 50000 URLs. > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) > at > org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627) > at > org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675) > > > We thought making smaller chunks would help the perfomance, but we didnt > even come around to test it beacause of the above error, any ideas? > > > > JoostRuiter wrote: >> >> Ok thanks for all your input guys! I`ll discuss this with my co-worker. >> Dennis, what more information do you need? >> >> Thanks everyone! >> >> >> Briggs wrote: >>> >>> One more thing... >>> >>> Are you using a distributed index? If this is so, you do not want to >>> do this; indexes should be local to the machine that is being >>> searched. >>> >>> On 4/23/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: >>>> Without more information this sounds like your tomcat search >>>> nutch-site.xml file is setup to use the DFS rather than the local file >>>> system. Remember that processing jobs occurs on the DFS but for >>>> searching, indexes are best moved to the local file system. >>>> >>>> Dennis Kubes >>>> >>>> JoostRuiter wrote: >>>> > Hi All, >>>> > >>>> > First off, I'm quite the noob when it comes to Nutch, so don't bash >>>> me if >>>> > the following is an enormously stupid question. >>>> > >>>> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM >>>> and a >>>> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of >>>> 15gig. >>>> > >>>> > >>>> > Performance is really poor, if we do get search results it will take >>>> > multiple minutes. When the query is longer we are getting the >>>> following: >>>> > >>>> > "java.lang.OutOfMemoryError: Java heap memory" >>>> > >>>> > What we have tried to improve on this: >>>> > - Slice the segments into smaller chuncks (max: 50000 url/per seg) >>>> > - Set io.map.index.skip to 8 >>>> > - Set indexer.termIndexInterval to 1024 >>>> > - Cluster with Hadoop (4 nodes to search) >>>> > >>>> > Any ideas? Missing information? Please let me know, this is my >>>> graduation >>>> > internship and I would really like to get a good grade ;) >>>> >>> >>> >>> -- >>> "Conscious decisions by conscious minds are what make reality real" >>> >>> >> >> > > -- View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10158788 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers