[
https://issues.apache.org/jira/browse/HDFS-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838159#comment-13838159
]
Colin Patrick McCabe commented on HDFS-5182:
--------------------------------------------
So, previously we discussed a few different ways for the {{DataNode}} to notify
the {{DFSClient}} about a change in the block's mlock status.
One way (let's call this choice #1) was using a shared memory segment. This
would take the form of a third file descriptor passed from the {{DataNode}} to
the {{DFSClient}}. On Linux, this would simply be a 4kb file from the
{{/dev/shm}} filesystem, which is a {{tmpfs}} filesystem. That filesystem is
the best choice because it will not cause the file to be written to memory
every {{dirty_centisecs}}.
However, on looking into this further, I found some issues with this method.
There is no way for the {{DataNode}} to know when the {{DFSClient}} has closed
the file descriptor for the shared memory area. We could add some kind of
protocol for keeping the area alive by writing to an agreed-upon location, but
that would add a fair amount of complexity, and might be triggered accidentally
in the case of a garbage collection event on the {{DFSClient}} or {{DataNode}}.
Another issue is that there is no way for the {{DataNode}} to revoke access to
this shared memory segment. If the {{DFSClient}} wants to hold on to it
forever, leaking memory, it can do that. This opens a hole. The client might
not have UNIX permissions to grab space in {{/dev/shm}}, but through this
mechanism it can consume an arbitrary amount of space there.
The other way (let's call this choice #2) is for the client to keep open the
Domain socket it used to request the two file descriptors. If we can listen
for messages sent on this socket, we can have a truly edge-triggered
notification method. The messages can be as short as a single byte, since we
have very simple message needs. This requires adding an epoll loop to handle
these notifications without consuming a whole thread per socket.
Regardless of whether we go with choice #1 or #2, there are some other things
that need to be done.
* Right now, we don't allow {{BlockReaderLocal}} instances to share file
descriptors with each other. However, this would be advisable, to avoid
creating 100 pipes/shm areas when someone re-opens the same file 100 times.
Doing this is actually an easy change (I wrote and tested the patch already).
* We need to revise {{FileInputStreamCache}} to store the communication method
(pipe or shared memory area) which will be giving us notifications. This cache
also needs to get support for dealing with mmap regions, and for BRL instances
sharing FDs / mmaps. I have a patch which reworks this cache, but it's not
quite done yet.
* {{BlockReaderLocal}} needs to get support for switching back and forth
between honoring checksums and not. I have a patch which substantially reworks
BRL to add this capability, which I'm considering posting as a separate JIRA.
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's
> valid
> ---------------------------------------------------------------------------------
>
> Key: HDFS-5182
> URL: https://issues.apache.org/jira/browse/HDFS-5182
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Affects Versions: 3.0.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
>
> BlockReaderLocal must allow zero-copy reads only when the DN believes it's
> valid. This implies adding a new field to the response to
> REQUEST_SHORT_CIRCUIT_FDS. We also need some kind of heartbeat from the
> client to the DN, so that the DN can inform the client when the mapped region
> is no longer locked into memory.
--
This message was sent by Atlassian JIRA
(v6.1#6144)