[
https://issues.apache.org/jira/browse/HDFS-14111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703069#comment-16703069
]
Steve Loughran commented on HDFS-14111:
---------------------------------------
thanks for pointing me @ this.
#1 Have a look at HADOOP-15229 to see some work I'm doing on a new openFile
operation where
* we can add extensible optional/mandatory config options
* return value is a completable future, so you may be able to code-round delays
And in HADOOP-15691 I've been exploring a path capabilities to prequery to see
if a path has a speciific option.
if we get HADOOP-15691 in you could maybe do some check first
With HADOOP-15229, it'd be nice to add an option which any FS could implement
to be their "first read offset", a hint you could use if you wanted to open the
stream immediately, but knew where from.
I'd like to see those path cababilites in, as bytebuffers, flush & stuff are
things where its impossible for all filesystems to support, and even HDFS can
vary stuff with things like encryption and erasure coding. And the new
openFile() call I really want to get in ASAP. If you can help with that to make
sure it suits this then that'd be nice.
w.r.t read(0), not thought about it Would it trigger an EOF exception if you've
already done a seek past EOF? If so, a change in the observable semantics. If
not, prebuffering and things are one of those implicit-semantics things which
impossible to guarantee preservation of (or how people use it)
all the object stores now move to lazy-open, though they do all currently do
some HEAD to get length of file, that its actually there, and that its not a
directory. The async stuff in HADOOP-15229 is to help deal with the fact that
can be slow, and as there's invariably a thread pool in the stores, they can do
it async (base impl does it blocking in the build() call, but delays any IO
failures until get())
> hdfsOpenFile on HDFS causes unnecessary IO from file offset 0
> -------------------------------------------------------------
>
> Key: HDFS-14111
> URL: https://issues.apache.org/jira/browse/HDFS-14111
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client, libhdfs
> Affects Versions: 3.2.0
> Reporter: Todd Lipcon
> Priority: Major
>
> hdfsOpenFile() calls readDirect() with a 0-length argument in order to check
> whether the underlying stream supports bytebuffer reads. With DFSInputStream,
> the read(0) isn't short circuited, and results in the DFSClient opening a
> block reader. In the case of a remote block, the block reader will actually
> issue a read of the whole block, causing the datanode to perform unnecessary
> IO and network transfers in order to fill up the client's TCP buffers. This
> causes performance degradation.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]