[
https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HADOOP-3205:
--------------------------------
Attachment: hadoop-3205.txt
Here's a patch which fixes the bugs that caused the unit test failures.
There's one TODO still in the code to figure out a good setting for MAX_CHUNKS
(ie the max number of checksum chunks that should be read in one call to the
underlying stream).
This is still TODO since I made an odd discovery about this - the logic we were
going on here was that the performance improvement was due to an eliminated
buffer copy when the size of the read where >= the size of the buffer in the
underlying BufferedInputStream. This would mean that the correct size for
MAX_CHUNKS is ceil(io.file.buffer.size / 512) (ie 256 for a 128KB buffer I was
testing with). If MAX_CHUNKS is less than that, then reads to the BIS would be
less than its buffer size and thus you'd incur a copy.
However, my benchmarking shows that this *isn't* the performance gain. Even
with MAX_CHUNKS set to 4, there's a significant performance gain over
MAX_CHUNKS set to 1. There is no significant difference between MAX_CHUNKS=127
and MAX_CHUNKS=128 for a 64K buffer, whereas the understanding above would
indicate that 128 would eliminate a copy whereas 127 would not.
So, I think this is actually improving performance because of some other effect
like better cache locality by operating in larger chunks. Admittedly, cache
locality is always the fallback excuse for a performance increase, but I don't
have a better explanation yet. Anyone care to hazard a guess?
> Read multiple chunks directly from FSInputChecker subclass into user buffers
> ----------------------------------------------------------------------------
>
> Key: HADOOP-3205
> URL: https://issues.apache.org/jira/browse/HADOOP-3205
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Reporter: Raghu Angadi
> Assignee: Todd Lipcon
> Attachments: hadoop-3205.txt, hadoop-3205.txt
>
>
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have
> access to full user buffer. At any time DFS can access only up to 512 bytes
> even though user usually reads with a much larger buffer (often controlled by
> io.file.buffer.size). This requires implementations to double buffer data if
> an implementation wants to read or write larger chunks of data from
> underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two
> separate jiras.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.