[ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Stefan Groschupf updated NUTCH-192:
-----------------------------------

    Attachment: metadata060206.patch

Doug, did you mean something like this?
Writing 1 mio map's (with one tuple [int key, long value]) into a sequence file 
that use a int key takes around 5400 ms on my box.
Writing 1 mio int key, utf8 values into a sequence files took pretty much the 
same time. 
However reading utf8 is requre 60 % of the time i need to read the map. This is 
may depends that utf8 just reads a byte array and convert the string first if 
toString is called. If I call toString in my test than reading utf8 is slower 
that reading the map. 
So another possible improvement could be to read just a byte array into the map 
and 'parsing' this byte array first and only when the first get method is 
called. 
This can save some time in processing crawlDatum in situation where we do not 
need to access the meta data at all.  
However reading and writing of a 10 mio map's with one key value tuple can be 
done in less than a minute on my desktop box.  

> meta data support for CrawlDatum
> --------------------------------
>
>          Key: NUTCH-192
>          URL: http://issues.apache.org/jira/browse/NUTCH-192
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>      Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to