[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899810#action_12899810
]
Andrzej Bialecki commented on NUTCH-882:
-----------------------------------------
This functionality is very useful for larger crawls. Some comments about the
design:
* the table can be populated by injection, as in the patch, or from webtable.
Since keys are from different spaces (url-s vs. hosts) I think it would be very
tricky to try to do this on the fly in one of the existing jobs... so this
means an additional step in the workflow.
* I'm worried about the scalability of the approach taken by HostMDApplierJob -
per-host data will be multiplied by the number of urls from a host and put into
webtable, which will in turn balloon the size of webtable...
A little background: what we see here is a design issue typical for mapreduce,
where you have to merge data keyed by keys from different spaces (with
different granularity). Possible solutions involve:
* first converting the data to a common key space and then submit both data as
mapreduce inputs, or
* submitting only the finer-grained input to mapreduce and dynamically
converting the keys on the fly (and reading data directly from the
coarser-grained source, accessing it randomly).
A similar situation is described in HADOOP-3063 together with a solution,
namely, to use random access and use Bloom filters to quickly discover missing
keys.
So I propose that instead of statically merging the data (HostMDApplierJob) we
could merge it dynamically on the fly, by implementing a high-performance
reader of host table, and then use this reader directly in the context of
map()/reduce() tasks as needed. This reader should use a Bloom filter to
quickly determine nonexistent keys, and it may use a limited amount of
in-memory cache for existing records. The bloom filter data should be
re-computed on updates and stored/retrieved, to avoid lengthy initialization.
The cost of using this approach is IMHO much smaller than the cost of
statically joining this data. The static join costs both space and time to
execute an additional jon. Let's consider the dynamic join cost, e.g. in
Fetcher - HostDBReader would be used only when initializing host queues, so the
number of IO-s would be at most the number of unique hosts on the fetchlist (at
most, because some of host data may be missing - here's Bloom filter to the
rescue to quickly discover this without doing any IO). During updatedb we would
likely want to access this data in DbUpdateReducer. Keys are URLs here, and
they are ordered in ascending order - but they are in host-reversed format,
which means that URLs from similar hosts and domains are close together. This
is beneficial, because when we read data from HostDBReader we will read records
that are close together, thus avoiding seeks. We can also cache the retrieved
per-host data in DbUpdateReducer.
> Design a Host table in GORA
> ---------------------------
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: 2.0
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and
> domains?) would be very useful for :
> * customising the behaviour of the fetching on a host basis e.g. number of
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages
> * keeping a copy of the robots.txt and possibly use that later to filter the
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments
> are of course already welcome
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.