[
https://issues.apache.org/jira/browse/HDFS-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ramtin updated HDFS-1950:
-------------------------
Assignee: (was: ramtin)
> Blocks that are under construction are not getting read if the blocks are
> more than 10. Only complete blocks are read properly.
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-1950
> URL: https://issues.apache.org/jira/browse/HDFS-1950
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client, namenode
> Affects Versions: 0.20.205.0
> Reporter: ramkrishna.s.vasudevan
> Priority: Blocker
> Attachments: HDFS-1950-2.patch, HDFS-1950.1.patch,
> hdfs-1950-0.20-append-tests.txt, hdfs-1950-trunk-test.txt,
> hdfs-1950-trunk-test.txt
>
>
> Before going to the root cause lets see the read behavior for a file having
> more than 10 blocks in append case..
> Logic:
> ====
> There is prefetch size dfs.read.prefetch.size for the DFSInputStream which
> has default value of 10
> This prefetch size is the number of blocks that the client will fetch from
> the namenode for reading a file..
> For example lets assume that a file X having 22 blocks is residing in HDFS
> The reader first fetches first 10 blocks from the namenode and start reading
> After the above step , the reader fetches the next 10 blocks from NN and
> continue reading
> Then the reader fetches the remaining 2 blocks from NN and complete the write
> Cause:
> =======
> Lets see the cause for this issue now...
> Scenario that will fail is "Writer wrote 10+ blocks and a partial block and
> called sync. Reader trying to read the file will not get the last partial
> block" .
> Client first gets the 10 block locations from the NN. Now it checks whether
> the file is under construction and if so it gets the size of the last partial
> block from datanode and reads the full file
> However when the number of blocks is more than 10, the last block will not be
> in the first fetch. It will be in the second or other blocks(last block will
> be in (num of blocks / 10)th fetch)
> The problem now is, in DFSClient there is no logic to get the size of the
> last partial block(as in case of point 1), for the rest of the fetches other
> than first fetch, the reader will not be able to read the complete data
> synced...........!!
> also the InputStream.available api uses the first fetched block size to
> iterate. Ideally this size has to be increased
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)