[ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema updated NUTCH-1340: -------------------------------- Attachment: NUTCH-1340-v2.txt v2 of patch, including javadoc. This patch increases performance, but when updating huge crawls it still can be a bit troublesome to process the huge amounts of deletes. However this is something that needs to be solved in Gora. Committed! Thanks Lewis. > Increase scalability by only removing markers when they actually exist for > DbUpdaterReducer > ------------------------------------------------------------------------------------------- > > Key: NUTCH-1340 > URL: https://issues.apache.org/jira/browse/NUTCH-1340 > Project: Nutch > Issue Type: Improvement > Reporter: Ferdy Galema > Fix For: nutchgora > > Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt > > > After applying GORA-120 (this already is a huge performance boost by itself) > one of the major bottlenecks of the DbUpdaterReducer is the deletion of the > markers. The update reducer simply sets every row to delete its markers. A > lot of rows do not actually have the markers but the deletes are fired away > in any case. Because the markers are already always on the input, a simple > check to see if they exist greaty improves performance. > In particular it is very expensive in HBase, because every single Delete > inmediately triggers a connection to the regionservers. (They ignore the > "autoflush=false" directive). Although deletes can be done in batch, this is > currently not supported by Gora. For one it is very difficult to implement in > the current HBaseStore with regard to multithreading, and secondly I noticed > performance did not increase significantly. > By performance debugging on a real life cluster this currently seems to be > the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying > GORA-120) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira