[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899810#action_12899810
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-----------------------------------------

This functionality is very useful for larger crawls. Some comments about the 
design:

* the table can be populated by injection, as in the patch, or from webtable. 
Since keys are from different spaces (url-s vs. hosts) I think it would be very 
tricky to try to do this on the fly in one of the existing jobs... so this 
means an additional step in the workflow.

* I'm worried about the scalability of the approach taken by HostMDApplierJob - 
per-host data will be multiplied by the number of urls from a host and put into 
webtable, which will in turn balloon the size of webtable...

A little background: what we see here is a design issue typical for mapreduce, 
where you have to merge data keyed by keys from different spaces (with 
different granularity). Possible solutions involve:
* first converting the data to a common key space and then submit both data as 
mapreduce inputs, or
* submitting only the finer-grained input to mapreduce and dynamically 
converting the keys on the fly (and reading data directly from the 
coarser-grained source, accessing it randomly).

A similar situation is described in HADOOP-3063 together with a solution, 
namely, to use random access and use Bloom filters to quickly discover missing 
keys.

So I propose that instead of statically merging the data (HostMDApplierJob) we 
could merge it dynamically on the fly, by implementing a high-performance 
reader of host table, and then use this reader directly in the context of 
map()/reduce() tasks as needed. This reader should use a Bloom filter to 
quickly determine nonexistent keys, and it may use a limited amount of 
in-memory cache for existing records. The bloom filter data should be 
re-computed on updates and stored/retrieved, to avoid lengthy initialization.

The cost of using this approach is IMHO much smaller than the cost of 
statically joining this data. The static join costs both space and time to 
execute an additional jon. Let's consider the dynamic join cost, e.g. in 
Fetcher - HostDBReader would be used only when initializing host queues, so the 
number of IO-s would be at most the number of unique hosts on the fetchlist (at 
most, because some of host data may be missing - here's Bloom filter to the 
rescue to quickly discover this without doing any IO). During updatedb we would 
likely want to access this data in DbUpdateReducer. Keys are URLs here, and 
they are ordered in ascending order - but they are in host-reversed format, 
which means that URLs from similar hosts and domains are close together. This 
is beneficial, because when we read data from HostDBReader we will read records 
that are close together, thus avoiding seeks. We can also cache the retrieved 
per-host data in DbUpdateReducer.

> Design a Host table in GORA
> ---------------------------
>
>                 Key: NUTCH-882
>                 URL: https://issues.apache.org/jira/browse/NUTCH-882
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to