[
https://issues.apache.org/jira/browse/HDFS-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086541#comment-13086541
]
Aaron T. Myers commented on HDFS-2269:
--------------------------------------
Hey Dave, should this JIRA perhaps be moved to Common instead of HDFS?
> Need for Integrity Validation of RPC
> ------------------------------------
>
> Key: HDFS-2269
> URL: https://issues.apache.org/jira/browse/HDFS-2269
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node, name-node
> Reporter: Dave Thompson
>
> Some recent investigation of network packet corruption has shown a need for
> hadoop RPC integrity validation beyond assurances already provided by 802.3
> link layer and TCP 16-bit CRC.
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4
> TCP anomalies per second on a single node, sustained over an hour (14k per
> hour). A TCP anomaly would be an escaped link layer packet that resulted
> in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]: http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10. 1 in 1024 errors
> may escape detection, and in fact what originally alerted us to this issue
> was seeing failures due to bit-errors in hadoop traffic. Extrapolating from
> that paper, one might expect 14 escaped packet errors per hour for that
> single node of a 4k cluster. While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not
> having an integrity check on RPC makes it problematic to discover, and limit
> any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:
> "Performance of Checksums and CRCs over Real Data" by Jonathan Stone, Michael
> Greenwald, Craig Partridge, Jim Hughes.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira