[ https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764428#comment-13764428 ]
Owen O'Malley commented on HDFS-4953: ------------------------------------- {quote} You can't know ahead of time whether your call to mmap will succeed. As I said, mmap can fail, for dozens of reasons. And of course blocks move over time. There is a fundamental "time of check, time of use" (TOCTOU) race condition in this kind of API. {quote} Ok, I guess I'm fine with the exception assuming the user passed in a null factory. It will be expensive in terms of time, but it won't affect the vast majority of users. {quote} is it necessary to read 200 MB at a time to decode the ORC file format? {quote} Actually, yes. The set of rows that are written together is large (typically ~200MB) so that reading them is efficient. For a 100 column table, that means that you have all of the values for column 1 in the first ~2MB, followed by all of the values for column 2 in the next 2MB, etc. To read the first row, you need all 100 of the 2MB sections. Obviously mmapping this is much more efficient, because the pages of the file can be brought in as needed. {quote} There is already a method named FSDataInputStream#read(ByteBuffer buf) in FSDataInputStream. If we create a new method named FSDataInputStream#readByteBuffer, I would expect there to be some confusion between the two. That's why I proposed FSDataInputStream#readZero for the new name. Does that make sense? {quote} I see your point, but readZero, which sounds like it just fills zeros into a byte buffer, doesn't convey the right meaning. The fundamental action that the user is taking is in fact read. I'd propose that we overload it with the other read and comment it saying that this read supports zero copy while the other doesn't. How does this look? {code} /** * Read a byte buffer from the stream using zero copy if possible. Typically the read will return * maxLength bytes, but it may return fewer at the end of the file system block or the end of the * file. * @param factory a factory that creates ByteBuffers for the read if the region of the file can't be * mmapped. * @param maxLength the maximum number of bytes that will be returned * @return a ByteBuffer with between 1 and maxLength bytes from the file. The buffer should be released * to the stream when the user is done with it. */ public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws IOException; {code} {quote} I'd like to get some other prospective zero-copy API users to comment on whether they like the wrapper object or the DFSInputStream#releaseByteBuffer approach better... {quote} Uh, that is exactly what is happening. I'm a user who is trying to use this interface for a very typical use case of quickly reading bytes that may or may not be on the local machine. I also care a lot about APIs and have been working on Hadoop for 7.75 years. {quote} If, instead of returning a ByteBuffer from the readByteBuffer call, we returned a ZeroBuffer object wrapping the ByteBuffer, we could simply call ZeroBuffer#close() {quote} Users don't want to make interfaces for reading from some Hadoop type named ZeroBuffer. The user wants a ByteBuffer because it is a standard Java type. To make this concrete and crystal clear, I have to make Hive and ORC work with both Hadoop 1.x and Hadoop 2.x. Therefore, if you use a non-standard type I need to wrap it in a shim. That sucks. Especially, if it is in the inner loop, which this absolutely would be. I *need* a ByteBuffer because I can make a shim that always returns a ByteBuffer that works regardless of which version of Hadoop that the user is using. > enable HDFS local reads via mmap > -------------------------------- > > Key: HDFS-4953 > URL: https://issues.apache.org/jira/browse/HDFS-4953 > Project: Hadoop HDFS > Issue Type: New Feature > Affects Versions: 2.3.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: HDFS-4949 > > Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, > HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, > HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch > > > Currently, the short-circuit local read pathway allows HDFS clients to access > files directly without going through the DataNode. However, all of these > reads involve a copy at the operating system level, since they rely on the > read() / pread() / etc family of kernel interfaces. > We would like to enable HDFS to read local files via mmap. This would enable > truly zero-copy reads. > In the initial implementation, zero-copy reads will only be performed when > checksums were disabled. Later, we can use the DataNode's cache awareness to > only perform zero-copy reads when we know that checksum has already been > verified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira