[
https://issues.apache.org/jira/browse/HADOOP-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048500#comment-18048500
]
ASF GitHub Bot commented on HADOOP-19767:
-----------------------------------------
anujmodi2021 opened a new pull request, #8153:
URL: https://github.com/apache/hadoop/pull/8153
### Description of PR
Since the onset of ABFS Driver, there has been a single implementation of
AbfsInputStream. Different kinds of workloads require different heuristics to
give the best performance for that type of workload. For example:
Sequential Read Workloads like DFSIO and DistCP gain performance improvement
from prefetched
Random Read Workloads on other hand do not need Prefetches and enabling
prefetches for them is an overhead and TPS heavy
Query Workloads involving Parquet/ORC files benefit from improvements like
Footer Read and Small Files Reads
To accomodate this we need to determine the pattern and accordingly create
Input Streams implemented for that particular pattern.
<img width="635" height="290" alt="image"
src="https://github.com/user-attachments/assets/5b7a3db9-ab04-43cf-b44e-5e7a6582205f"
/>
Moving ahead more relevant policies and specialized implementation of
AbfsInputStream can be added.
This PR only refactors the way we create input streams. No logical change
introduced. As today by default we will continue to use AbfsAdaptiveInputStream
which can cater to all kind of workloads.
### How was this patch tested?
New tests were added.
> ABFS: [Read] Introduce Abfs Input Policy for detecting read patterns
> --------------------------------------------------------------------
>
> Key: HADOOP-19767
> URL: https://issues.apache.org/jira/browse/HADOOP-19767
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Affects Versions: 3.4.2
> Reporter: Anuj Modi
> Assignee: Anuj Modi
> Priority: Major
>
> Since the onset of ABFS Driver, there has been a single implementation of
> AbfsInputStream. Different kinds of workloads require different heuristics to
> give the best performance for that type of workload. For example:
> # Sequential Read Workloads like DFSIO and DistCP gain performance
> improvement from prefetched
> # Random Read Workloads on other hand do not need Prefetches and enabling
> prefetches for them is an overhead and TPS heavy
> # Query Workloads involving Parquet/ORC files benefit from improvements like
> Footer Read and Small Files Reads
> To accomodate this we need to determine the pattern and accordingly create
> Input Streams implemented for that particular pattern.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]