Re: [DISCUSS] Introduce data cache in Paimon reader

Jingsong Li Tue, 23 Jul 2024 03:21:11 -0700

Thanks Aitozi for starting this discussion.

+1 to have a block cache.


I suggest you add where we need to modify and what the core API is.

Best,
Jingsong

On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com> wrote:
>
> Hi, wj
>     Thanks for your comments.
> (1) In an OLAP system, the same query may be executed multiple times, and
> different snapshots may share the same data file.
> Therefore, caching can help reduce the need to fetch data from remote
> storage.
> (2) Both CachedSeekableInputStream and BlockCache will be used, the
> CachedSeekableInputStream will use BlockCache to find the target block
> (3) A BlockQueue holds the list of available blocks that can be used to
> store data.
>
> Thanks,
> Aitozi.
>
> wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道：
>
> > Thanks Aitozi for initiating this discussion.
> > I have some questions:
> >
> > (1) Why need this cache in the analysis senior? When scan a snapshot,
> > why a dataFile will be read multiple times?
> > (2) CachedSeekableInputStream and BlockCache, which implementation do
> > you prefer to choose?
> > (3) In BlockCache, why introduce a BlockQueue?
> >
> > Best,
> > wangwj
> >
> > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com> wrote:
> > >
> > > Hi, Fang Yong
> > >
> > >     Thanks for your valuable comments. Here are some of my thoughts on
> > your
> > > question
> > >
> > > (1) The distributed cache and local file cache actually work in different
> > > locations, and their functions are orthogonal.
> > > Therefore, I believe that these two can be used together. So this
> > proposal
> > > mainly focus on the local cache
> > > (2) In our design, the scheduler utilizes the consistent hash strategy to
> > > assign DataSplits to computing nodes,
> > > enabling cache colocation scheduling.
> > >
> > > Repost the doc on wiki page:
> > >
> > >
> > https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> > >
> > > Thanks,
> > > Aitozi.
> > >
> > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道：
> > >
> > > > Thanks Aitozi for initiating this discussion. For the data cache, I
> > have
> > > > some questions:
> > > >
> > > > 1. In the design document, the focus is mainly on block cache. In a
> > > > complete cache system, it is usually divided into distributed cache,
> > local
> > > > file cache, block cache, and key-value cache. Compared with block
> > cache,
> > > > would it be more effective to introduce a distributed cache such as
> > > > Alluxio?
> > > >
> > > > 2. For the computing engine: What interfaces should Paimon's cache
> > provide
> > > > so that the computing engine can be aware of which computing nodes
> > cache
> > > > which data, and facilitate the deployment of computing tasks to the
> > > > appropriate computing nodes at the scheduling layer?
> > > >
> > > > Best,
> > > > FangYong
> > > >
> > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <gjying1...@gmail.com> wrote:
> > > >
> > > > > Hi devs:
> > > > >     I want to initiate a discussion on the ability to support data
> > cache
> > > > in
> > > > > the Paimon reader, aiming to accelerate the performance of scan
> > operators
> > > > > in analytical scenarios. The detailed design document is as follows
> > [1].
> > > > > Looking forward to your feedback.
> > > > >
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > > >
> > https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> > > > >
> > > > > Thanks
> > > > > Aitozi.
> > > > >
> > > >
> >

Re: [DISCUSS] Introduce data cache in Paimon reader

Reply via email to