[ 
https://issues.apache.org/jira/browse/HBASE-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell resolved HBASE-2478.
-----------------------------------

    Resolution: Not a Problem
      Assignee:     (was: Hairong Kuang)

Didn't happen

> Experiment with alternate settings for io.bytes.per.checksum for HFiles
> -----------------------------------------------------------------------
>
>                 Key: HBASE-2478
>                 URL: https://issues.apache.org/jira/browse/HBASE-2478
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance
>            Reporter: Kannan Muthukkaruppan
>
> HDFS keeps a separate "checksum" file for every  block. By default, 
> io.bytes.per.checksum is set at 512, and the checksums are 4 bytes... i.e. 
> for every 512 bytes of data in the block we maintain a 4 byte checksum. For 
> 4TB of data, for instance, that's about 31GB of checksum data.
> A read that needs to read a small section (such as a 64k HFile block) from a 
> HDFS block, especially on a cold access, is likely to end up doing two random 
> disk reads--- one from the data file for the block and one from the checksum 
> file.
> A though was that instead of keeping a checksum for every 512 bytes, given  
> that HBase will interact with HDFS on reads at the granularity of HBase block 
> size (typically 64k, but smaller if compressed), should we consider keeping 
> checksums at a coarser granularity (e.g, for every 8k bytes) for HFiles?  The 
> advantage
> with this would be that the checksum files would be much smaller (in 
> proportion to the data) and the hot working set for "checksum data"  should 
> fit better in the OS buffer cache (thus eliminating a good majority of the 
> disk seeks for checksum data).
> The intent of the JIRA is to experiment with different settings for 
> "io.bytes.per.checksum" for HFiles. 
> Note: For the previous example, of 4TB of data, with an io.bytes.per.checksum 
> setting of 8k, the size of the checksum data would drop to about 2Gig.
> Making the io.bytes.per.checksum too big might reduce the effectiveness of 
> the checksum. So that needs to be taken into account as well in terms of 
> determining a good value.
> [For HLogs files, on the other hand, I suspect we would want to leave the 
> checksum at finer granularity because my understanding is that if we are 
> doing lots of small writes/syncs (as we do to HLogs), finer grained checksums 
> are better (because the code currently doesn't do a rolling checksum, and 
> needs to rewind to the nearest checksum block boundary and recomputed the 
> checksum on every edit).]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to