[ http://issues.apache.org/jira/browse/NUTCH-378?page=all ]

Andrzej Bialecki  updated NUTCH-378:
------------------------------------

    Attachment: MetaWrapper.java

> MetaWrapper decorator
> ---------------------
>
>                 Key: NUTCH-378
>                 URL: http://issues.apache.org/jira/browse/NUTCH-378
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: MetaWrapper.java
>
>
> First, a bit of background.
> Currently some tools (Indexer, SegmentMerger, CrawlDbReducer) use 
> ObjectWritable to pass data from different parts of segment(s) to map-reduce 
> methods. However, there is a high risk that this data is processed 
> incorrectly, because map-reduce methods no longer know the exact source of 
> any given data item.
> Example: Indexer may process many segments at the same time. In its reduce() 
> method it receives a set of values coming from different parts of the 
> segment, but found at the same key (url). However, if the same page is 
> fetched multiple times, Indexer will receive multiple sets of values from 
> different segments (e.g. multiple fetchDatum, parseData, etc). It may happen 
> that some of this data items it picks up for further processing belong to one 
> set, and some other data to another, resulting in the final set that is a 
> hodge-podge of partial data coming from different segments. This could be 
> avoided if each value had metadata to mark it as belonging to a particular 
> segment. Indexer could then collect all complete multiple sets, and then 
> select the most recent one for further processing.
> Similar situation occurs in SegmentMerger, where data coming from different 
> segments is tagged with its source. However, ParseText class doesn't support 
> any metadata, so its text has to be changed to contain the tag. This is 
> unwieldy and far from elegant.
> A different problem occurs in CrawlDbReducer - we have instances of the same 
> class, but it's sometimes difficult to determine where they originally came 
> from. This also limits us to update CrawlDb from 1 segment at a time, 
> otherwise CrawlDatum instances from earlier segments would be 
> indistinguishable from those from newer segments... In short, the 
> functionality and internal logic here could be vastly improved if we knew 
> where any CrawlDatum instance came from.
> The attached class provides this functionality - instead of using 
> ObjectWritable (or plain CrawlDatum) we can wrap instances of input data in 
> MetaWritable, and add necessary metadata that will support the processing at 
> hand. Then in map-reduce methods we can unpack original values, and use 
> additional metadata.
> Note: this wrapping/unwrapping is aplied only during map-reduce jobs - data 
> stored in DBs and segments would remain the same.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to