[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508818 ]
Doğacan Güney commented on NUTCH-392: ------------------------------------- > Re: Content versioning - we can use negative int values as version numbers. > I'm still not sure what is the impact of > BLOCK compression on MapFile random access. Good idea! (Btw, I still believe that BLOCK compression's performance hit is irrelevant for anything but parse_text. That's why I am trying to do the second test. I was trying to test how fast random access on parse_text is under different compressions. BLOCK compression will probably be not fast enough for parse_text. But if the impact is minor, it can be used for everything else.) > Regarding the sizes: parse_text_record size is larger, because for small > chunks of data the compression overhead may far > outweigh the compression gains. Re: the large size of crawl_parse - is this > related to your patch? It could be simply related to > the fact that there are many outlinks in those pages ... Or is crawl_parse > using BLOCK compression too? OK, I understand why parse_text_record is larger, thanks for the explanation. But why is parse_text_block's size so close to parse_text (why is content so different from parse_text? BLOCK creates wonders in content but does not even give a 10% in parse_text.). Feed plugin wasn't enabled so my patch shouldn't matter. Also, crawl_parse is using NONE compression. > OutputFormat implementations should pass on Progressable > -------------------------------------------------------- > > Key: NUTCH-392 > URL: https://issues.apache.org/jira/browse/NUTCH-392 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Doug Cutting > Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: NUTCH-392.patch > > > OutputFormat implementations should pass the Progressable they are passed to > underlying SequenceFile implementations. This will keep reduce tasks from > timing out when block writes are slow. This issue depends on > http://issues.apache.org/jira/browse/HADOOP-636. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers