That's really interesting. From the literature I have been reading,
I always though that the politeness policies required a 1s - 5s wait
between queries to the same host. However, I always thought that
regardless of HTTP/1.0 or HTTP/1.1 this was the case and if not, do
we really want to add the logic of determining if this is a "num-
sites-per-minute-host" or a "num-sites-per-hour-host" or a "1-request-
per-second-host"? If we assume the former options then we will
definitely increase our throughput but most likely get a lot of hate
mail from webmasters for smaller sites, and if we assume the latter
than we are where we started.
--Ned
(formerly nedrocks at gmail dot com)
On Oct 24, 2007, at 8:29 AM, Ken Krugler wrote:
Ned Rockson wrote:
So recently I switched to Fetcher2 over Fetcher for larger whole
web fetches (50-100M at a time). I found that the URLs generated
are not optimal because they are simply randomized by a hash
comparator. In one crawl on 24 machines it took about 3 days to
crawl 30M URLs. In comparison with old benchmarks I had set with
regular Fetcher.java this was at least 3 fold more time.
Anyway, I realized that the best situation for ordering can be
approached by randomization, but in order to get optimal
ordering, urls from the same host should be as far apart in the
list as possible. So I wrote a series of 2 map/reduces to
optimize the ordering and for a list of 25M documents it takes
about 10 minutes on our cluster. Right now I have it in its own
class, but I figured it can go in Generator.java and just add a
flag in nutch-default.xml determining if the user wants to use it.
So, should I submit the code by email? Is there some way to
change Generator.java or should I just submit the function in its
own class?
This sounds intriguing. Nutch mailing lists strip attachments -
you should create a JIRA issue, copy this description, and attach
a patch in unified diff format (svn diff will do).
For what it's worth, a year ago I had a series of discussions with
people who manage large web sites, w.r.t. acceptable crawl rates.
This was in the context of whether using keep-alive made sense, to
optimize crawl performance.
For sites that mostly serve up static content, keep-alive was seen
as a good thing, in that it reduced the load on the server
constantly creating/destroying connections.
For sites that build pages from multiple data sources, this was a
minor issue - they were more concerned about total # of pages
fetched per time interval, given the high cost of assembling the page.
Which is the interesting point - nobody talked about the amount of
time between requests. They all were interested in monitoring pages
per time interval - typically an hour for smaller sites, and pages
per minute for ones with heavier traffic.
Based on the above, what seems like the most efficient approach to
fetching would be to use a pages/minute/domain restriction, and use
a kept-alive connection (w/HTTP 1.1) with no delay until that limit
is reached, then switch to another set of URLs from a different
domain.
This assumes a maximum of one thread/domain, and that URLs are
segmented such that each domain is handled inside of one task, such
that the one thread/domain and pages/minute/domain restrictions can
be enforced properly.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"