[
https://issues.apache.org/jira/browse/HBASE-11927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540905#comment-14540905
]
Apekshit Sharma commented on HBASE-11927:
-----------------------------------------
There were a couple of options. NHL(native hadoop library) and
[Circe|https://github.com/trevorr/circe]
We decided to go with NHL, despite the fact that it introduces dependency on
hadoop, because hfile checksum requires interface which take two streams, data
and checksums, and verifies/calculates checksums for chunks of a fixed size
data. NHL already supports it while Circe doesn't. (More differences in this
[doc|https://docs.google.com/document/d/1NCB3h8YU86mGFjK_uWA7KMDmu288nrCZvwRTr30zX-s/edit]
We switched from CRC32 as default to CRC32C because:
- crc32c has better error detection properties
- crc32c has advantage of dedicated instruction on newer Intel processors
(couldn't profile this case because the machines i used for testing weren't new
enough, ie didn't support [sse4.2|http://en.wikipedia.org/wiki/SSE4#SSE4.2]
instructions)
Profiling was done using lightweight-java-profiler.
> Use Native Hadoop Library for HFile checksum
> --------------------------------------------
>
> Key: HBASE-11927
> URL: https://issues.apache.org/jira/browse/HBASE-11927
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: Apekshit Sharma
> Attachments: HBASE-11927-v1.patch, HBASE-11927.patch, c2021.crc2.svg,
> c2021.write.2.svg, c2021.zip.svg, compact-with-native.svg,
> compact-without-native.svg, crc32ct.svg
>
>
> Up in hadoop they have this change. Let me publish some graphs to show that
> it makes a difference (CRC is a massive amount of our CPU usage in my
> profiling of an upload because of compacting, flushing, etc.). We should
> also make use of native CRCings -- especially the 2.6 HDFS-6865 and ilk -- in
> hbase but that is another issue for now.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)