[ 
https://issues.apache.org/jira/browse/HADOOP-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050609#comment-18050609
 ] 

ASF GitHub Bot commented on HADOOP-19767:
-----------------------------------------

anujmodi2021 commented on PR #8153:
URL: https://github.com/apache/hadoop/pull/8153#issuecomment-3723523348

   > @anujmodi2021 I am trying to propose a single optimised implementation of 
an input stream across cloud implementations, as I think we all need this kind 
of logic. Ideally I want to get to a place where 80% of the logic is shared in 
a common layer, and then we only implement cloud specific clients to actually 
make the requests separately.
   > 
   > There is some consensus to move the shared logic into the parquet-java 
repo: https://lists.apache.org/thread/nbksq32cs8h1ldj8762y6wh9zzp8gqx6 , and 
some buy-in from the team at google. I'll be following up on this in the new 
year.
   > 
   > Would be great to get your thoughts and if your team would also like to 
collaborate on this.
   
   Thanks for heads up @ahmarsuhail 
   This sounds like a good plan to me as well. We will surely keep a close eye 
on the updates on this thread and try to contribute to make things better in 
best way possible.
   
   With this change we are not chaning how ABFS handles parquet file though. 
This just improves the infra and add capability for future improvements to be 
plugged in seemlessly. We will surely help address any gaps in ABFS to make 
things better for the common ground you are gearing up to improve.
   
   




> ABFS: [Read] Introduce Abfs Input Policy for detecting read patterns
> --------------------------------------------------------------------
>
>                 Key: HADOOP-19767
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19767
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>    Affects Versions: 3.4.2
>            Reporter: Anuj Modi
>            Assignee: Anuj Modi
>            Priority: Major
>              Labels: pull-request-available
>
> Since the onset of ABFS Driver, there has been a single implementation of 
> AbfsInputStream. Different kinds of workloads require different heuristics to 
> give the best performance for that type of workload. For example: 
>  # Sequential Read Workloads like DFSIO and DistCP gain performance 
> improvement from prefetched 
>  # Random Read Workloads on other hand do not need Prefetches and enabling 
> prefetches for them is an overhead and TPS heavy 
>  # Query Workloads involving Parquet/ORC files benefit from improvements like 
> Footer Read and Small Files Reads
> To accomodate this we need to determine the pattern and accordingly create 
> Input Streams implemented for that particular pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to