[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508861 ]
Doğacan Güney commented on NUTCH-392: ------------------------------------- After changing ParseText to not do any internal compression, segment directory looks like this: 828M crawl/segments/20070626163143/content 35M crawl/segments/20070626163143/crawl_fetch 23M crawl/segments/20070626163143/crawl_generate 44M crawl/segments/20070626163143/crawl_parse # BLOCK compression 218M crawl/segments/20070626163143/parse_data 524M crawl/segments/20070626163143/parse_text 192M crawl/segments/20070626163143/parse_text_block 242M crawl/segments/20070626163143/parse_text_record As you can see parse_text_block is around %20 percent smaller than parse_text_record. I also wrote a simple benchmark that randomly requests n urls from each parse text sequentially (but it requests the same urls in the same order from all parse texts). All parse texts contain a single part with ~250K urls. Here are the results (Trial 0 is NONE, trial is RECORD, trial 2 is BLOCK): for n = 1000: Trial 0 has taken 9947 ms. Trial 1 has taken 6794 ms. Trial 2 has taken 9717 ms. for n = 5000: Trial 0 has taken 40918 ms. Trial 1 has taken 19969 ms. Trial 2 has taken 52622 ms. for n = 10000 Trial 0 has taken 57622 ms. Trial 1 has taken 24291 ms. Trial 2 has taken 96292 ms. Overall RECORD compression is the fastest and BLOCK compression is the slowest (by a large margin). Assuming my benchmark code is correct (feel free to show me where it is wrong), these are my conclusions: * I don't know what others think, but to me it still looks like we can use BLOCK compression for structures like content, linkdb, etc. Even though, it is much slower than RECORD, it can still serve ~100 parse texts per second. While, this is certainly not good enough for parse text, it probably is good enough for others. * We should definitely enable RECORD compression for parse text and BLOCK compression for crawl_*. For some reason, RECORD compression performs better than O(n) (which makes me think that something is wrong with my benchmark code) for parse text. * Nutch should not do any compression internally. Hadoop can do this better with its native compression. Content and ParseText compress their data on their own (and they can be converted to hadoop's compression in a backward-compatible way). I don't know if anything else does compression. PS: Native hadoop library is loaded. I haven't specified which compression codec to use so I guess it uses zlib. Lzo results would have probably been better. > OutputFormat implementations should pass on Progressable > -------------------------------------------------------- > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Doug Cutting > Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch, ParseTextBenchmark.java > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers