Hi, Fang Yong
Thanks for your valuable comments. Here are some of my thoughts on your
question
(1) The distributed cache and local file cache actually work in different
locations, and their functions are orthogonal.
Therefore, I believe that these two can be used together. So this proposal
mainly focus on the local cache
(2) In our design, the scheduler utilizes the consistent hash strategy to
assign DataSplits to computing nodes,
enabling cache colocation scheduling.
Repost the doc on wiki page:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
Thanks,
Aitozi.
Yong Fang <[email protected]> 于2024年7月16日周二 14:37写道:
> Thanks Aitozi for initiating this discussion. For the data cache, I have
> some questions:
>
> 1. In the design document, the focus is mainly on block cache. In a
> complete cache system, it is usually divided into distributed cache, local
> file cache, block cache, and key-value cache. Compared with block cache,
> would it be more effective to introduce a distributed cache such as
> Alluxio?
>
> 2. For the computing engine: What interfaces should Paimon's cache provide
> so that the computing engine can be aware of which computing nodes cache
> which data, and facilitate the deployment of computing tasks to the
> appropriate computing nodes at the scheduling layer?
>
> Best,
> FangYong
>
> On Tue, Jul 16, 2024 at 10:45 AM Aitozi <[email protected]> wrote:
>
> > Hi devs:
> > I want to initiate a discussion on the ability to support data cache
> in
> > the Paimon reader, aiming to accelerate the performance of scan operators
> > in analytical scenarios. The detailed design document is as follows [1].
> > Looking forward to your feedback.
> >
> >
> > [1]:
> >
> >
> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> >
> > Thanks
> > Aitozi.
> >
>