MetaWrapper decorator
---------------------

                 Key: NUTCH-378
                 URL: http://issues.apache.org/jira/browse/NUTCH-378
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 
         Attachments: MetaWrapper.java

First, a bit of background.

Currently some tools (Indexer, SegmentMerger, CrawlDbReducer) use 
ObjectWritable to pass data from different parts of segment(s) to map-reduce 
methods. However, there is a high risk that this data is processed incorrectly, 
because map-reduce methods no longer know the exact source of any given data 
item.

Example: Indexer may process many segments at the same time. In its reduce() 
method it receives a set of values coming from different parts of the segment, 
but found at the same key (url). However, if the same page is fetched multiple 
times, Indexer will receive multiple sets of values from different segments 
(e.g. multiple fetchDatum, parseData, etc). It may happen that some of this 
data items it picks up for further processing belong to one set, and some other 
data to another, resulting in the final set that is a hodge-podge of partial 
data coming from different segments. This could be avoided if each value had 
metadata to mark it as belonging to a particular segment. Indexer could then 
collect all complete multiple sets, and then select the most recent one for 
further processing.

Similar situation occurs in SegmentMerger, where data coming from different 
segments is tagged with its source. However, ParseText class doesn't support 
any metadata, so its text has to be changed to contain the tag. This is 
unwieldy and far from elegant.

A different problem occurs in CrawlDbReducer - we have instances of the same 
class, but it's sometimes difficult to determine where they originally came 
from. This also limits us to update CrawlDb from 1 segment at a time, otherwise 
CrawlDatum instances from earlier segments would be indistinguishable from 
those from newer segments... In short, the functionality and internal logic 
here could be vastly improved if we knew where any CrawlDatum instance came 
from.

The attached class provides this functionality - instead of using 
ObjectWritable (or plain CrawlDatum) we can wrap instances of input data in 
MetaWritable, and add necessary metadata that will support the processing at 
hand. Then in map-reduce methods we can unpack original values, and use 
additional metadata.

Note: this wrapping/unwrapping is aplied only during map-reduce jobs - data 
stored in DBs and segments would remain the same.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to