[
https://issues.apache.org/jira/browse/HADOOP-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843852#comment-16843852
]
Steve Loughran commented on HADOOP-16317:
-----------------------------------------
Be happy to talk about this some time, with the lessons of the S3A work. FWIW,
I think we need to revisit some of those assumptions of the S3A connector, and
do so based on more recent trace data from Hive/Spark/Impala queries of both
ORC and Parquet.
* HADOOP-13203 was driven by ORC+ Hive data from 2016; Hive optimisations may
have obsoleted those
* Parquet seems to use different read APIs and I don't have trace data there
* [~stakiar] is looking at Impala perf against stores; again there's a new
* HADOOP-15229 adds an openFile() call where you can pass down config options.
S3A takes that fs.s3a.experimental.fadvise policy -- if you were to add
something similar to ABFS then we could declare a standard option for
cross-store use. And you can provide an async HEAD probe for faster opening.
* HADOOP-11867 looks at a vector read API ; there's an ABFS dependent. If we
can move ORC and Parquet to that API, then it will line you up for the ability
to make decisions in your connector for how best to do the reads (reorder,
merge, submit as parallel GETs, use HTTP/2, etc).
I'm not doing any work on HADOOP-11867, and I don't know anyone else who is,
though I know people who would like it. If you were willing to go that way
-work up the stack rather than just in the connector, dealing with the minimal
sequential information coming from the apps today, you'd have an opportunity to
do profound things.
> ABFS: improve random read performance
> -------------------------------------
>
> Key: HADOOP-16317
> URL: https://issues.apache.org/jira/browse/HADOOP-16317
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.2.0
> Reporter: Da Zhou
> Priority: Major
>
> Improving random read performance is an interesting topic. ABFS doesn't
> perform well when reading column format files as the process involves with
> many seek operations which make the readAhead no use, and if readAheadĀ is
> used unwisely it would lead to unnecessary data request.
> Hence creating this Jira as a reminder to track the investigation and
> progress of the work.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]