[ https://issues.apache.org/jira/browse/HBASE-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217094#comment-13217094 ]
Phabricator commented on HBASE-5074: ------------------------------------ dhruba has commented on the revision "[jira] [HBASE-5074] Support checksums in HBase block cache". INLINE COMMENTS src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:451-452 I think it is better to not add another 4 bytes to the HFileBlock (increases heapSize), instead just compute it when needed, especially since this method is used only for debugging. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:529-530 shall we avoid increasing the HeapSize vs computing headerSize? It should be really cheap to compute headerSize(), especially since it is likely to be inlined. src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1636 I think we should always print this. This follows the precedence in other parts of the HBase code. And this code path is the exception and not the norm src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:1642-1644 I am pretty sure that it is better to construct this message only if there is a checksum mismatch. src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:3610-3612 The secret is to pass in a HFileSystem to HRegion.newHRegion(). This HFileSystem is extracted from the RegionServerServices, if it is not-null. Otherwise, a default file system object is created and passed into HRegion.newHRegion src/main/java/org/apache/hadoop/hbase/util/ChecksumType.java:57-60 getName() is better because it allows annotating the name differently from what Java does vi toString (especially if we add new crc algorithms in the future) src/main/java/org/apache/hadoop/hbase/util/ChecksumType.java:143-144 I would like to keep getName() because it allows us to not change the API if we decide to override java's toString convention, especially if we add new checksum algorithms in the future. (Similar to why there are two separate methods Enum.name and Enum.toString) src/main/java/org/apache/hadoop/hbase/util/ChecksumType.java:179 That's right. But the existence of this API allows us to do own own names in the future. (Also, when there are only two or three values, this might be better than looking into a map) src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockCompatibility.java:1 I am not planning to change that, this code is what was there in HFileBlock, so it is good to carry it over in a unit test to be able to generate files in the older format. This is used by unit tests alone. JUst replacing it with a pre-created file(s) is not very cool, especially because the pre-created file(s) will test only that file whereas if we keep this code here, we can write more and more unit tests in the future that can generate different files in the older format and test backward compatibility. REVISION DETAIL https://reviews.facebook.net/D1521 > support checksums in HBase block cache > -------------------------------------- > > Key: HBASE-5074 > URL: https://issues.apache.org/jira/browse/HBASE-5074 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: dhruba borthakur > Assignee: dhruba borthakur > Attachments: D1521.1.patch, D1521.1.patch, D1521.10.patch, > D1521.10.patch, D1521.2.patch, D1521.2.patch, D1521.3.patch, D1521.3.patch, > D1521.4.patch, D1521.4.patch, D1521.5.patch, D1521.5.patch, D1521.6.patch, > D1521.6.patch, D1521.7.patch, D1521.7.patch, D1521.8.patch, D1521.8.patch, > D1521.9.patch, D1521.9.patch > > > The current implementation of HDFS stores the data in one block file and the > metadata(checksum) in another block file. This means that every read into the > HBase block cache actually consumes two disk iops, one to the datafile and > one to the checksum file. This is a major problem for scaling HBase, because > HBase is usually bottlenecked on the number of random disk iops that the > storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira