[ 
https://issues.apache.org/jira/browse/HADOOP-17038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176942#comment-17176942
 ] 

Steve Loughran commented on HADOOP-17038:
-----------------------------------------

You have to show some impressive speedups here, on your particular codepath

I can see a glance, that the patch will be pathologically bad for ORC/Parquet 
code where high performance back-to-back reads are critical. If someone were to 
set this property in hadoop-site, all analytics queries would suffer 
significantly.

In HBase, you will note that there are two different stream uses, and that for 
anything going end to end through the file readahead is critical for 
performance. If those full scans don't suffer when you enable the pread, then 
it is going to be luck of the use of their APIs alone -which is something that 
is at risk being very very brittle.


The good news is: there is now a way to specify options like this on a 
file-by-file basis, specifically the openFile() API.

The default implementation of that simply falls back to open(path) -if abfs 
overrode it then there are opportunities for speedups on opening any file 
(ability to skip HEAD check or do the HEAD asynchronously).

That API call, with its ability to set per-open-request 
options/mandatory-switches would let HBase choose a short/zero readahead policy 
for those streams where there is no need for it.

I'd argue then -the the better solution here is not to deliver performance in 
what is quite a dangerous/brittle way, but allow applications to explicitly 
choose readahead policies on a per file basis.


If you don't want to do that (yet), and instead put this option in purely for 
HBase, I propose:

* The configuration has to be explicit about what it does. 
"fs.azure.enable.pread" sounds like it is enabling something when really it is 
"disable readhead on positioned read calls". 

* you make sure everything is thread safe. HBase has assumptions there which 
have caused surprises in the past.

* It gets documentation. I propose a new "abfs-performance.md" file in 
hadoop-azure site; this would be the first section, "optimising abfs for short 
reads through the PositionedReadable API" discussing what, why, how *and when 
not to use*. ABFS lacks anything here, unlike s3a: 
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html.
 Your tables will be very relevant there.

* Do you think it should be logged in ABFS FS instantiation? Absolutely during 
init @ debug, but maybe at INFO, so that if people wonder why things are slow 
the cause will become visible. The option should be going into hbase-site, not 
hadoop-site, but I fear people doing the tuning may miss that.






> Support positional read in AbfsInputStream
> ------------------------------------------
>
>                 Key: HADOOP-17038
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17038
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Anoop Sam John
>            Assignee: Anoop Sam John
>            Priority: Major
>              Labels: HBase, abfsactive
>         Attachments: HBase Perf Test Report.xlsx, screenshot-1.png
>
>
> Right now it will do a seek to the position , read and then seek back to the 
> old position.  (As per the impl in the super class)
> In HBase kind of workloads we rely mostly on short preads. (like 64 KB size 
> by default).  So would be ideal to support a pure pos read API which will not 
> even keep the data in a buffer but will only read the required data as what 
> is asked for by the caller. (Not reading ahead more data as per the read size 
> config)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to