Hi, I've recently switched to 0.8 from 0.7, and after some initial fits and starts, I'm past the "get it working at all" stage to the "get reasonable performance" stage.
I've got a single machine with 4 CPUs and a lot of memory. URL fetching works great because it's (mostly) multithreaded. But as soon as I hit the reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and the phase can take days, leaving me vulnerable to losing everything should a process fail. Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some help getting my configuration right. I've seen examples/tutorials of configurations for multiple machines; am I just "faking" multiple machines on my single node (will that work?) or is there a cleaner, simpler approach? Alternatively, I was all excited to get an easy improvement with -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it looks like -numFetchers has gone away, and though there was an 0.8 version patch, at a quick glance this didn't seem to have made it into the mainline source, and I don't see the value of trying to merge this in if there's a cleaner Hadoop-based approach. Many thanks for any help. Doug -- View this message in context: http://www.nabble.com/Best-performance-approach-for-single-MP-machine--tf1970539.html#a5409596 Sent from the Nutch - User forum at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
