I have been working on improving the Generator for the last couple of
days and here are the discussion areas I have come up with so far:
1) Would resolving IP addresses inside of the generator be useful? If we
are limiting the number of urls to fetch then this would allow us to
remove UnknownHosts before hand, essentially giving us a better fetch
list. Cons are it could as much as double the DNS load as it is
happening once during generate and once during fetching. As it is I am
working on a patch that give the option to either resolve it or not.
2) Normalization of urls inside of generate. Currently in the reduce
method of Selector inside of generate there is a normalization call.
Personally I think this is in the wrong place. I think this should be in
selector map method. As it is currently, the normalizer doesn't have
any effect anyways because we are not collecting a changed url (a bug).
If we were to put the normalizer in the map method, then the possibility
for duplicate urls from normalization arises as well. If I am not
mistaken this would need another MR job at the end of generator to
remove duplicates.
Another options would be just to NOT have the normalization option
inside of Generator. Is there a good reason to have normalization in
Generator?
Looking for thoughts from the community on these issues.
Dennis