[ 
https://issues.apache.org/jira/browse/HDFS-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786998#comment-16786998
 ] 

Gopal V commented on HDFS-14345:
--------------------------------

bq. HDFS-6803, HDFS-6735, HADOOP-15557, and HADOOP-11708 spring to mind. We 
don't dare weaken the thread safety of the default DFS streams.

All of them are for positional-read - I'd like to not touch that part, because 
it is doing the right thing (& ORC is using readFully etc).

This ticket is specifically for the non-positional read (i.e which modifies the 
offset of the read position & isn't really synchronized).

bq. which of these are places where switching to a new version would deliver 
speedups without doing anything to HDFS or other FSDataInputStream-wrapped 
connections?

Most of Mapreduce would benefit, the place where this is significant right now 
is Shuffle output/input right now.

Here's the Tez and Hive fixes, I've made (and HADOOP-10694 related to Writable)

https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/common/io/NonSyncByteArrayInputStream.java
+
https://github.com/apache/tez/blob/master/tez-common/src/main/java/org/apache/tez/common/io/NonSyncByteArrayInputStream.java

> fs.BufferedFSInputStream::read is synchronized
> ----------------------------------------------
>
>                 Key: HDFS-14345
>                 URL: https://issues.apache.org/jira/browse/HDFS-14345
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: Gopal V
>            Priority: Major
>         Attachments: bufferedinputstream-read-sync.png
>
>
> BufferedInputStream::read() has performance issues - this can be fixed by 
> wrapping the stream in another non-synchronized buffered inputstream, but 
> that incurs memory copy overheads and is sub-optimal.
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/io/BufferedInputStream.java#L269
> Hadoop fs streams aren't thread-safe (except for ReadFully) and are stateful 
> for position, so this synchronization is purely a tax without benefit.
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java#L35
> The readFully skips the BufferedInputStream super classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to