Re: [DISCUSS] Introduce data cache in Paimon reader

Aitozi Tue, 16 Jul 2024 02:59:43 -0700

Hi, wj
    Thanks for your comments.
(1) In an OLAP system, the same query may be executed multiple times, and
different snapshots may share the same data file.
Therefore, caching can help reduce the need to fetch data from remote
storage.
(2) Both CachedSeekableInputStream and BlockCache will be used, the
CachedSeekableInputStream will use BlockCache to find the target block
(3) A BlockQueue holds the list of available blocks that can be used to
store data.


Thanks,
Aitozi.

wj wang <[email protected]> 于2024年7月16日周二 17:42写道：

> Thanks Aitozi for initiating this discussion.
> I have some questions:
>
> (1) Why need this cache in the analysis senior? When scan a snapshot,
> why a dataFile will be read multiple times?
> (2) CachedSeekableInputStream and BlockCache, which implementation do
> you prefer to choose?
> (3) In BlockCache, why introduce a BlockQueue?
>
> Best,
> wangwj
>
> On Tue, Jul 16, 2024 at 3:07 PM Aitozi <[email protected]> wrote:
> >
> > Hi, Fang Yong
> >
> >     Thanks for your valuable comments. Here are some of my thoughts on
> your
> > question
> >
> > (1) The distributed cache and local file cache actually work in different
> > locations, and their functions are orthogonal.
> > Therefore, I believe that these two can be used together. So this
> proposal
> > mainly focus on the local cache
> > (2) In our design, the scheduler utilizes the consistent hash strategy to
> > assign DataSplits to computing nodes,
> > enabling cache colocation scheduling.
> >
> > Repost the doc on wiki page:
> >
> >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> >
> > Thanks,
> > Aitozi.
> >
> > Yong Fang <[email protected]> 于2024年7月16日周二 14:37写道：
> >
> > > Thanks Aitozi for initiating this discussion. For the data cache, I
> have
> > > some questions:
> > >
> > > 1. In the design document, the focus is mainly on block cache. In a
> > > complete cache system, it is usually divided into distributed cache,
> local
> > > file cache, block cache, and key-value cache. Compared with block
> cache,
> > > would it be more effective to introduce a distributed cache such as
> > > Alluxio?
> > >
> > > 2. For the computing engine: What interfaces should Paimon's cache
> provide
> > > so that the computing engine can be aware of which computing nodes
> cache
> > > which data, and facilitate the deployment of computing tasks to the
> > > appropriate computing nodes at the scheduling layer?
> > >
> > > Best,
> > > FangYong
> > >
> > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <[email protected]> wrote:
> > >
> > > > Hi devs:
> > > >     I want to initiate a discussion on the ability to support data
> cache
> > > in
> > > > the Paimon reader, aiming to accelerate the performance of scan
> operators
> > > > in analytical scenarios. The detailed design document is as follows
> [1].
> > > > Looking forward to your feedback.
> > > >
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> > > >
> > > > Thanks
> > > > Aitozi.
> > > >
> > >
>

Re: [DISCUSS] Introduce data cache in Paimon reader

Reply via email to