Increase scalability by only removing markers when they actually exist for 
DbUpdaterReducer
-------------------------------------------------------------------------------------------

                 Key: NUTCH-1340
                 URL: https://issues.apache.org/jira/browse/NUTCH-1340
             Project: Nutch
          Issue Type: Improvement
            Reporter: Ferdy Galema
             Fix For: nutchgora


After applying GORA-120 (this already is a huge performance boost by itself) 
one of the major bottlenecks of the DbUpdaterReducer is the deletion of the 
markers. The update reducer simply sets every row to delete its markers. A lot 
of rows do not actually have the markers but the deletes are fired away in any 
case. Because the markers are already always on the input, a simple check to 
see if they exist greaty improves performance.

In particular it is very expensive in HBase, because every single Delete 
inmediately triggers a connection to the regionservers. (They ignore the 
"autoflush=false" directive). Although deletes can be done in batch, this is 
currently not supported by Gora. For one it is very difficult to implement in 
the current HBaseStore with regard to multithreading, and secondly I noticed 
performance did not increase significantly.

By performance debugging on a real life cluster this currently seems to be the 
biggest bottleneck of the DbUpdaterReducer. (Remember only after applying 
GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to