[
https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036525#comment-13036525
]
Todd Lipcon commented on HDFS-1057:
-----------------------------------
aha! I think I understand what's going on here!
The test has a thread which continually re-opens the file which is being
written to. Since the file's in the middle of being written, it makes an RPC to
the DataNode in order to determine the visible length of the file. This RPC is
authenticated using the block token which came back in the LocatedBlocks object
as the security ticket.
When this RPC hits the IPC layer, it looks at its existing connections and sees
none that can be re-used, since the block token differs between the two
requesters. Hence, it reconnects, and we end up with hundreds or thousands of
IPC connections to the datanode.
This also explains why Sam doesn't see it on his 0.20 append branch -- there
are no block tokens there, so the RPC connection is getting reused properly.
I'll file another JIRA about this issue.
> Concurrent readers hit ChecksumExceptions if following a writer to very end
> of file
> -----------------------------------------------------------------------------------
>
> Key: HDFS-1057
> URL: https://issues.apache.org/jira/browse/HDFS-1057
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: data-node
> Affects Versions: 0.20-append, 0.21.0, 0.22.0
> Reporter: Todd Lipcon
> Assignee: sam rash
> Priority: Blocker
> Fix For: 0.20-append, 0.21.0, 0.22.0
>
> Attachments: HDFS-1057-0.20-append.patch,
> conurrent-reader-patch-1.txt, conurrent-reader-patch-2.txt,
> conurrent-reader-patch-3.txt, hdfs-1057-trunk-1.txt, hdfs-1057-trunk-2.txt,
> hdfs-1057-trunk-3.txt, hdfs-1057-trunk-4.txt, hdfs-1057-trunk-5.txt,
> hdfs-1057-trunk-6.txt
>
>
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before
> calling flush(). Therefore, if there is a concurrent reader, it's possible to
> race here - the reader will see the new length while those bytes are still in
> the buffers of BlockReceiver. Thus the client will potentially see checksum
> errors or EOFs. Additionally, the last checksum chunk of the file is made
> accessible to readers even though it is not stable.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira