Hi d_k,

On Tue, Jan 21, 2014 at 11:20 AM, <dev-digest-h...@nutch.apache.org> wrote:

>
> I'm working on porting NUTCH-1622 to Nutch 2
>

Excellent


> and the path I took was to add a MapWritable field to the Outlink class to
> hold the metadata.
>

So identical to approach implemented by Julien in NUTCH-1622?


>
> In order to store the metadata in the WebPage so it will be passed along
> the mappers and reducers I used the metadata field of the WebPage class.
>

Yes this sounds right. AFAIK, this is the only place we can access it...


>
> Because the putToMetadata method of the WebPage accepts a ByteBuffer, in
> order to convert the MapWritable to a ByteBuffer i'm using something along
> the lines of:
>
>
...snip


> And I would be happy to get some input on:
> 1) Is it the correct way to convert the MapWritable to a ByteBuffer to be
> stored in the WebPage's metadata?
>

There are many instances of where we already convert to ByteBuffer for
entries to WebPage metadata map field. You can try grep'ing the codebase
for 'putToMetadata'. Generally speaking I think the code is OK. I am
interested to hear about where you are thinking of adding the code though?

 2) Should the metadata be stored in the metadata field as a ByteBuffer or
is there a better way to pass along the metadata?

AFAIK ByteBuffer is the way we want to do it. We have been caught out
however before with conversions of ByteBuffer-to-String, (see NUTCH-1591)
so we want to make this noth consistent and correct.


> 3) Did I waste my time working with MapWritable and could of used any java
> collection as long as the target JVM could of deserialized it considering
> that all that is passed is an array of bytes and Outlink is never passed as
> it is. Outlinks are passed as a map between url and anchor (utf8, utf8).
>
> ... my next change was to make the Utf8 allocation static... :-P
>

I honestly couldn't tell you what benefit gains we could obtain by using
MapWritable however it would be good practice for us to keep the
implementations for trunk and 2.x as similar as possible, especially with
regards to how Outlink's are represented.

Feel free to add your comments to the actual Jira issue.
Lewis

Reply via email to