[ https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515884 ]
Hadoop QA commented on HADOOP-1468: ----------------------------------- +1 http://issues.apache.org/jira/secure/attachment/12362641/patch.txt applied and successfully tested against trunk revision r559886. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/476/testReport/ Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/476/console > Add HBase batch update to reduce RPC overhead > --------------------------------------------- > > Key: HADOOP-1468 > URL: https://issues.apache.org/jira/browse/HADOOP-1468 > Project: Hadoop > Issue Type: New Feature > Components: contrib/hbase > Affects Versions: 0.15.0 > Reporter: Jim Kellerman > Assignee: Jim Kellerman > Fix For: 0.15.0 > > Attachments: patch.txt, patch.txt > > > On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote: > Hi, > > > > I'm noticing that since the HClient/HRegionServer interface only allows > > for a per-column put(), there is a lot of RPC and some lease management > > overhead when writing large amounts of data. For example: > > > > for (int i = 0; i < 10000; i++) { > > Text rowKey = new Text(i+""); > > long lock = client.startUpdate(rowKey); > > client.put(lock, COL1, rowKey.getBytes()); > > client.put(lock, COL2, someValue.getBytes()); > > client.commit(lock); > > } > > > > This code takes my machine (using a single HMaster/HRegionServer on > > local filesystem) approximately 13 seconds to execute. When i measure > > the execution time within HRegionServer.put() I get total time spent in > > put() < 2 seconds. So it looks like there's definately overhead in the > > RPC communication and serialization/deserialization between client and > > server. > > > > To write 10000 rows, 10000 x (startUpdate=1 + #cols=2 + commit=1) = > > 40000 RPC operations. > > > > What I'm thinking, and please tell me if i'm wrong or if this is already > > in the works, is that if I create a row-level put() method that submits > > a map of column values at once, I would reduce the 2 + (#cols) RPC > > operations to one single atomic row-write RPC as well as eliminate the > > small but noticeable overhead in lease creation, renewal, and cancellation. > > > > It's not clear exactly what the performance improvement would be. The > > same amount of serialization/deserilalization must occur, but YourKit > > profiling tells me that the serialization overhead is negligible. > > > > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.