Hello i have a crawldb consisting of about 57 million urls, i'm generating segments 1 million each.
Generate take 7 hrs 39 minutes to complete on my cluster. I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram and 150GB IDE drives Merging urls into crawldb took 4hrs, 34mins last time. What i wanted to ask is if that times are normal for that kind of configuration ? Is generate phase so processor intensive ? When i check i have two threads on each of nodes each taking up 100% time of ht pseudo-core. Also i have a problem with partitioning 3 times out of 4 it fails because it cannot open temporary files created by generate job. I'm using trunk version of nutch with hadoop 0.15. I need to find a way to speed up crawldb processing. I want to create an updateable index of about 30 million pages which could be updated every month. I do not need the scoring-opic plugin, but i couldn't disable it. I do not need it as i'm using index to search for plagiarism in our university students papers. I was thinking about moving whole crawldb into some database (mysql/postgres) and generating urls to crawl from there, then importing them to nutch using clean crawldb and text files. Please let me know if you have any suggestions on how to speed up the crawling process. -- Karol Rybak Programista / Programmer Sekcja aplikacji / Applications section Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology and Management +48(17)8661277
