[
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364791 ]
Andrzej Bialecki commented on NUTCH-192:
-----------------------------------------
We could take a middle ground - write out only the non-standard parts of the
dictionary. In vast majority of cases this is equivalent to not writing the
dictionary, and in rare cases we still have this flexibility.
First we would need to encode the standard dictionary inside WritableName (I
think it's better place than in MapWritable), but using a separate API so that
it's clear you cannot extend it accidentally just by calling setName. I.e.
something like this, in the WritableName.<clinit>:
WritableName.setName(NullWritable.class, "null");
WritableName.setID(NullWritable.class, 0);
WritableName.setName(LongWritable.class, "long");
WritableName.setID(LongWritable.class, 1);
...
* in WritableName.setID complain loudly if you overwrite an already existing ID.
* then in MapWritable use these 1-byte "standard IDs" as before. However:
* inside write(), first check that all types that the MapWritable uses for keys
and values are registered in WritableName. For any non-registered types create
a private additional dictionary, with IDs starting in the range above the
latest "standard ID". This dictionary we will have to write to the output, if
not empty.
* then inside write() write out all values and keys as before, using both the
standard IDs and non-standard ones from the dictionary.
> meta data support for CrawlDatum
> --------------------------------
>
> Key: NUTCH-192
> URL: http://issues.apache.org/jira/browse/NUTCH-192
> Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Fix For: 0.8-dev
> Attachments: metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch
> features realized and makes a lot possible to smaller special focused search
> engines.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira