> Just submit my patch and try to compile you will see what you need to > change. > Just some changes of new Properties() to ContentProperties() and may > the import of this class.
Cool, I'll have a look at your patch :) > >> It's much better than what I have right now. However, it's still not >> 100% and fetching all the urls would mean implementing some sort of >> iterative process until all the urls are finally fetched. >> Do you have an idea why we are still missing 10 to 20% ? > > > Well since i strated with dmoz that are the urls that does not exists > anymore but still listen in dmoz. You also have some general errors > like, unable to parse, host down etc. > So 10 % error rate is not to bad, if you have later on some hundred > million you will see that this error rate is around less than 5%. In my results I didn't include the urls that failed to fetch, regardless of the error. The % were the fetch attempts (so it includes the errors), which should be 100%. So, with your patch, did you see 100% of urls *attempting* a fetch ? Thanks, --Flo