[
https://issues.apache.org/jira/browse/HDFS-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13534501#comment-13534501
]
Todd Lipcon commented on HDFS-3429:
-----------------------------------
Hi Liang. In order to see a better improvement from this patch, you'd need a
dataset per node which is on the order of 100x bigger than the available buffer
cache -- ie so that the checksums themselves do not fit in cache. Talking with
folks at Facebook, where they have a similar improvement in place, they saw a
~30-40% improvement in random read performance due to a similar reduction in
IOPS. I believe they have TBs of data per node in this cluster.
> DataNode reads checksums even if client does not need them
> ----------------------------------------------------------
>
> Key: HDFS-3429
> URL: https://issues.apache.org/jira/browse/HDFS-3429
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, performance
> Affects Versions: 2.0.0-alpha
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-3429-0.20.2.patch, hdfs-3429-0.20.2.patch,
> hdfs-3429.txt, hdfs-3429.txt, hdfs-3429.txt
>
>
> Currently, even if the client does not want to verify checksums, the datanode
> reads them anyway and sends them over the wire. This means that performance
> improvements like HBase's application-level checksums don't have much benefit
> when reading through the datanode, since the DN is still causing seeks into
> the checksum file.
> (Credit goes to Dhruba for discovering this - filing on his behalf)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira