[
https://issues.apache.org/jira/browse/HDFS-6803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093266#comment-14093266
]
Colin Patrick McCabe commented on HDFS-6803:
--------------------------------------------
[[email protected]]: Hmm. I don't see much advantage in making
non-positional reads concurrent. When two threads do non-positional reads,
they inherently interfere with each other by modifying the position. So
concurrent non-positional reads would not be very useful for most programmers,
since you would basically not know what offset your read was starting at. it
would depend on the peculiarities of thread timing.
Concurrent positional reads (preads) are useful precisely because they don't
have this problem. You're not sharing a stream position with any other thread,
so you know what you're getting with your pread.
I think if we do allow concurrent non-positional reads, we should also document
that this is optional, and that the stream never reads the same byte offset
more than once.
bq. getPos() may block for an arbitrary amount of time if another thread is
attempting to perform a positioned read and is having some problem
communicating with the far end. Is that something we really want? Is it
something people expect?
HDFS has has this behavior for a long time. I checked back in Hadoop 0.20 and
the {{synchronized}} is there on getPos and read. I would be ok with getPos
returning the position locklessly (perhaps from an AtomicLong?) but to my
knowledge, nobody has ever requested that we change this.
{{pread}} should never affect the output of {{getPos}}. That would go against
the basic guarantee of positional read: that it doesn't alter the current
stream position. It doesn't really help FSes that implement pread as
seek+read+seek, either. Those filesystems have a basic problem-- the inability
to do concurrent preads-- that weakening the {{getPos}} guarantee can't
possibly solve. The real solution is to add a better pread implementation to
those filesystems.
(I do not think that concurrent pread should be required of all hadoop FSes,
but it should be highly encouraged for all implementors. And implementing
pread as seek+read+seek should be highly discouraged)
I like the idea of saying that operations in "group P" (read, seek, skip,
zero-copy read, releaseBuffer) can block each other, and every other operation
is asynchronous. I think that fits the needs of HBase, MR, and other clients
needs very well; what do you think?
> Documenting DFSClient#DFSInputStream expectations reading and preading in
> concurrent context
> --------------------------------------------------------------------------------------------
>
> Key: HDFS-6803
> URL: https://issues.apache.org/jira/browse/HDFS-6803
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Affects Versions: 2.4.1
> Reporter: stack
> Attachments: DocumentingDFSClientDFSInputStream (1).pdf
>
>
> Reviews of the patch posted the parent task suggest that we be more explicit
> about how DFSIS is expected to behave when being read by contending threads.
> It is also suggested that presumptions made internally be made explicit
> documenting expectations.
> Before we put up a patch we've made a document of assertions we'd like to
> make into tenets of DFSInputSteam. If agreement, we'll attach to this issue
> a patch that weaves the assumptions into DFSIS as javadoc and class comments.
--
This message was sent by Atlassian JIRA
(v6.2#6252)