[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508861
 ] 

Doğacan Güney commented on NUTCH-392:
-------------------------------------

After changing ParseText to not do any internal compression, segment directory 
looks like this:

828M    crawl/segments/20070626163143/content
35M     crawl/segments/20070626163143/crawl_fetch
23M     crawl/segments/20070626163143/crawl_generate
44M     crawl/segments/20070626163143/crawl_parse # BLOCK compression
218M    crawl/segments/20070626163143/parse_data
524M    crawl/segments/20070626163143/parse_text
192M    crawl/segments/20070626163143/parse_text_block
242M    crawl/segments/20070626163143/parse_text_record

As you can see parse_text_block is around %20 percent smaller than 
parse_text_record.

I also wrote a simple benchmark that randomly requests n urls from each parse 
text sequentially (but it requests the same urls in the same order from all 
parse texts). All parse texts contain a single part with ~250K urls. Here are 
the results (Trial 0 is NONE, trial is RECORD, trial 2 is BLOCK):

for n = 1000:
Trial 0 has taken 9947 ms.
Trial 1 has taken 6794 ms.
Trial 2 has taken 9717 ms.

for n = 5000:
Trial 0 has taken 40918 ms.
Trial 1 has taken 19969 ms.
Trial 2 has taken 52622 ms.

for n = 10000
Trial 0 has taken 57622 ms.
Trial 1 has taken 24291 ms.
Trial 2 has taken 96292 ms.

Overall RECORD compression is the fastest and BLOCK compression is the slowest 
(by a large margin).

Assuming my benchmark code is correct (feel free to show me where it is wrong), 
these are my conclusions:

* I don't know what others think, but to me it still looks like we can use 
BLOCK compression for structures like content, linkdb, etc. Even though, it is 
much slower than RECORD, it can still serve ~100 parse texts per second. While, 
this is certainly not good enough for parse text, it probably is good enough 
for others.

* We should definitely enable RECORD compression for parse text and BLOCK 
compression for crawl_*. For some reason, RECORD compression performs better 
than O(n) (which makes me think that something is wrong with my benchmark code) 
for parse text.

* Nutch should not do any compression internally. Hadoop can do this better 
with its native compression. Content and ParseText compress their data on their 
own (and they can be converted to hadoop's compression in a backward-compatible 
way). I don't know if anything else does compression.

PS: Native hadoop library is loaded. I haven't specified which compression 
codec to use so I guess it uses zlib. Lzo results would have probably been 
better.

> OutputFormat implementations should pass on Progressable
> --------------------------------------------------------
>
>                 Key: NUTCH-392
>                 URL: https://issues.apache.org/jira/browse/NUTCH-392
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Doug Cutting
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-392.patch, ParseTextBenchmark.java
>
>
> OutputFormat implementations should pass the Progressable they are passed to 
> underlying SequenceFile implementations.  This will keep reduce tasks from 
> timing out when block writes are slow.  This issue depends on 
> http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to