Anyway you would post your conf/nutch-site.xml and walk through your crawl process a bit?
Thanks, Earl --- Paul Harrison <[EMAIL PROTECTED]> wrote: > Murray, > > We are running on the following: > > 5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40 > GB OS drive and 2 SATA > 250 GB data drives each. We are running the latest > version of Fedora and > have the data drives setup with ReiserFS. We are > running JDK 1.5 and Tomcat > 5.5. > > On a small set of 20 million I don't see much of a > performance degredation; > especially if it is all on one machine. Where > things get bad is in the > distributed search. We are actually contemplating > rewriting the distributed > search code. > > Thanks, > > Paul > > -----Original Message----- > From: Murray Hunter > [mailto:[EMAIL PROTECTED] > Sent: Monday, October 17, 2005 9:11 AM > To: [email protected] > Subject: RE: Nutch Search Speed Concern > > Paul and TL, > I was wondering if you could detail how you have > your cluster's configured, > hardware wise ie. how many servers are used for each > purpose, epecially with > regard to how your storage is configured. > > We tested search for a 20 Million page index on a > dual core 64 bit machine > with 8 GB of ram using storage of the nutch data on > another server through > linux nfs, and it's performance was terrible. It > looks like the bottleneck > was nfs, so I was wondering how you had your storage > set up. Are you using > NDFS, or is it split up over multiple servers? We > are trying to build a > system that could handle at least 50 million pages, > so would appreciate any > advice on the the best way to configure the servers. > Originally we were > thinking 3 servers, 1 for crawling and indexing and > 2 for search servers > would be enough for that size of index. > > Thanks, > Murray > > -----Original Message----- > From: Paul Harrison [mailto:[EMAIL PROTECTED] > Sent: Friday, October 14, 2005 7:40 PM > To: [email protected] > Subject: RE: Nutch Search Speed Concern > > I too would love to hear some answers on this one. > We have a 100 million > page implementation on 5 machines, 4 GB of ram, and > 2 SATA drives of 250 GB > each. Part of what I have noticed is that Lucene > does some sort of strange > caching in that if you do subsequent searches on a > search the return results > are quite quick. I too have noticed that different > terms have different > search responses and that the problem gets worse > with the number of terms in > the query. I have also noticed that distributed > search has problems. The > main search machine waits on other machines to serve > up their results before > it will respond. So it appears that your search is > only as fast as your > slowest responding machine or whenever the timeout > hits (whichever comes > first). If anyone has any suggestions on tuning the > distributed search or > general suggestions on speeding up retrieval times > with a large set, I am > all ears. > > Thanks, > > Paul > > -----Original Message----- > From: TL [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 13, 2005 12:15 PM > To: [email protected] > Subject: Nutch Search Speed Concern > > Search Speed > > What are the most important factors in > nutch/lucene's search speed? > > I've been testing nutch's search speed on a search > pool with about 100M > records (separated evenly into 30 segments), and > discovered that certain > search terms have a signicantly higher search time > then others. > Some searches take 30 ms while others takes upwards > of 3000ms. > > At first, there seemed to be a direct relationship > between the total number > of results from a given query and the timeit took to > complete. But after > further testing, that relationship did not hold true > for all cases. There > seems to be other factors that directly affect the > speed of a search. > > Has anyone else encountered this issue? Or have some > insight to the impact > of certain factors on search speed? > > Thanks. > > - T > > > > __________________________________ > Yahoo! Music Unlimited > Access over 1 million songs. Try it free. > http://music.yahoo.com/unlimited/ > > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ ------------------------------------------------------- This SF.Net email is sponsored by: Power Architecture Resource Center: Free content, downloads, discussions, and more. http://solutions.newsforge.com/ibmarch.tmpl _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
