- "job.setPartitionerClass(PartitionUrlByHost.class);" in the generate
method

yes, this line is the one you need to change. The other stuff can be as it is for now.

Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?

Than also assign the case insensitive content properties patch to the
0.8. You may need to change 3 other classes (e.g fetcher) since the
patch is for 0.7.

Just submit my patch and try to compile you will see what you need to change. Just some changes of new Properties() to ContentProperties() and may the import of this class.

It's much better than what I have right now.  However, it's still not
100% and fetching all the urls would mean implementing some sort of
iterative process until all the urls are finally fetched.
Do you have an idea why we are still missing 10 to 20% ?

Well since i strated with dmoz that are the urls that does not exists anymore but still listen in dmoz. You also have some general errors like, unable to parse, host down etc. So 10 % error rate is not to bad, if you have later on some hundred million you will see that this error rate is around less than 5%.

Stefan

Reply via email to