[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364699 ] 

Stefan Groschupf commented on NUTCH-192:
----------------------------------------

* plus whatever it takes to put the class name->id mapping in the MapWritable 
header (the mapping table): let's assume 40 bytes. 

I do not write the mapping table in any kind to the out stream, by now the the 
id is caculated by a hash from the class name. 
I will change this so it will be a part of the class where I will manually 
assign LongWritable id = (byte)1, UTF8 id = (byte)2, etc.

For example writing a long ( e.g. a timestamp) as UTF8 require me 15 byte, 
writing it as LongWritable took me 8 byte.
8 byte plus 1 byte for the class type, is 60 % required space than using a 
String. 

I guess the main missunderstanding is that I do not write the clazz - id map 
into the stream at any time.
Makes that sense?
 


> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to