[jira] [Commented] (HDFS-4953) enable HDFS local reads via mmap

Owen O'Malley (JIRA) Wed, 11 Sep 2013 14:23:07 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764428#comment-13764428
 ]


Owen O'Malley commented on HDFS-4953:
-------------------------------------

{quote}
You can't know ahead of time whether your call to mmap will succeed. As I said, 
mmap can fail, for dozens of reasons. And of course blocks move over time. 
There is a fundamental "time of check, time of use" (TOCTOU) race condition in 
this kind of API.
{quote}

Ok, I guess I'm fine with the exception assuming the user passed in a null 
factory. It will be expensive in terms of time, but it won't affect the vast 
majority of users.

{quote}
is it necessary to read 200 MB at a time to decode the ORC file format?
{quote}

Actually, yes. The set of rows that are written together is large (typically 
~200MB) so that reading them is efficient. For a 100 column table, that means 
that you have all of the values for column 1 in the first ~2MB, followed by all 
of the values for column 2 in the next 2MB, etc. To read the first row, you 
need all 100 of the 2MB sections.

Obviously mmapping this is much more efficient, because the pages of the file 
can be brought in as needed.

{quote}
There is already a method named FSDataInputStream#read(ByteBuffer buf) in 
FSDataInputStream. If we create a new method named 
FSDataInputStream#readByteBuffer, I would expect there to be some confusion 
between the two. That's why I proposed FSDataInputStream#readZero for the new 
name. Does that make sense?
{quote}

I see your point, but readZero, which sounds like it just fills zeros into a 
byte buffer, doesn't convey the right meaning. The fundamental action that the 
user is taking is in fact read. I'd propose that we overload it with the other 
read and comment it saying that this read supports zero copy while the other 
doesn't.  How does this look?

{code}
/**
 * Read a byte buffer from the stream using zero copy if possible. Typically 
the read will return
 * maxLength bytes, but it may return fewer at the end of the file system block 
or the end of the
 * file.
 * @param factory a factory that creates ByteBuffers for the read if the region 
of the file can't be
 *   mmapped.
 * @param maxLength the maximum number of bytes that will be returned
 * @return a ByteBuffer with between 1 and maxLength bytes from the file. The 
buffer should be released
 *   to the stream when the user is done with it.
 */
public ByteBuffer read(ByteBufferFactory factory, int maxLength) throws 
IOException;
{code}

{quote}
I'd like to get some other prospective zero-copy API users to comment on 
whether they like the wrapper object or the DFSInputStream#releaseByteBuffer 
approach better...
{quote}

Uh, that is exactly what is happening. I'm a user who is trying to use this 
interface for a very typical use case of quickly reading bytes that may or may 
not be on the local machine. I also
care a lot about APIs and have been working on Hadoop for 7.75 years.

{quote}
If, instead of returning a ByteBuffer from the readByteBuffer call, we returned 
a ZeroBuffer object wrapping the ByteBuffer, we could simply call 
ZeroBuffer#close()
{quote}

Users don't want to make interfaces for reading from some Hadoop type named 
ZeroBuffer. The user wants a ByteBuffer because it is a standard Java type. To 
make this concrete and crystal clear, I have to make Hive and ORC work with 
both Hadoop 1.x and Hadoop 2.x. Therefore, if you use a non-standard type I 
need to wrap it in a shim. That sucks. Especially, if it is in the inner loop, 
which this absolutely would be. I *need* a ByteBuffer because I can make a shim 
that always returns a ByteBuffer that works regardless of which version of 
Hadoop that the user is using.
                
> enable HDFS local reads via mmap
> --------------------------------
>
>                 Key: HDFS-4953
>                 URL: https://issues.apache.org/jira/browse/HDFS-4953
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>    Affects Versions: 2.3.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: HDFS-4949
>
>         Attachments: benchmark.png, HDFS-4953.001.patch, HDFS-4953.002.patch, 
> HDFS-4953.003.patch, HDFS-4953.004.patch, HDFS-4953.005.patch, 
> HDFS-4953.006.patch, HDFS-4953.007.patch, HDFS-4953.008.patch
>
>
> Currently, the short-circuit local read pathway allows HDFS clients to access 
> files directly without going through the DataNode.  However, all of these 
> reads involve a copy at the operating system level, since they rely on the 
> read() / pread() / etc family of kernel interfaces.
> We would like to enable HDFS to read local files via mmap.  This would enable 
> truly zero-copy reads.
> In the initial implementation, zero-copy reads will only be performed when 
> checksums were disabled.  Later, we can use the DataNode's cache awareness to 
> only perform zero-copy reads when we know that checksum has already been 
> verified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-4953) enable HDFS local reads via mmap

Reply via email to