Hi Jingsong
     I have updated the wiki with the API section. Please review it again.

Thanks,
Aitozi

Jingsong Li <[email protected]> 于2024年7月23日周二 18:20写道:

> Thanks Aitozi for starting this discussion.
>
> +1 to have a block cache.
>
> I suggest you add where we need to modify and what the core API is.
>
> Best,
> Jingsong
>
> On Tue, Jul 16, 2024 at 5:59 PM Aitozi <[email protected]> wrote:
> >
> > Hi, wj
> >     Thanks for your comments.
> > (1) In an OLAP system, the same query may be executed multiple times, and
> > different snapshots may share the same data file.
> > Therefore, caching can help reduce the need to fetch data from remote
> > storage.
> > (2) Both CachedSeekableInputStream and BlockCache will be used, the
> > CachedSeekableInputStream will use BlockCache to find the target block
> > (3) A BlockQueue holds the list of available blocks that can be used to
> > store data.
> >
> > Thanks,
> > Aitozi.
> >
> > wj wang <[email protected]> 于2024年7月16日周二 17:42写道:
> >
> > > Thanks Aitozi for initiating this discussion.
> > > I have some questions:
> > >
> > > (1) Why need this cache in the analysis senior? When scan a snapshot,
> > > why a dataFile will be read multiple times?
> > > (2) CachedSeekableInputStream and BlockCache, which implementation do
> > > you prefer to choose?
> > > (3) In BlockCache, why introduce a BlockQueue?
> > >
> > > Best,
> > > wangwj
> > >
> > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <[email protected]> wrote:
> > > >
> > > > Hi, Fang Yong
> > > >
> > > >     Thanks for your valuable comments. Here are some of my thoughts
> on
> > > your
> > > > question
> > > >
> > > > (1) The distributed cache and local file cache actually work in
> different
> > > > locations, and their functions are orthogonal.
> > > > Therefore, I believe that these two can be used together. So this
> > > proposal
> > > > mainly focus on the local cache
> > > > (2) In our design, the scheduler utilizes the consistent hash
> strategy to
> > > > assign DataSplits to computing nodes,
> > > > enabling cache colocation scheduling.
> > > >
> > > > Repost the doc on wiki page:
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> > > >
> > > > Thanks,
> > > > Aitozi.
> > > >
> > > > Yong Fang <[email protected]> 于2024年7月16日周二 14:37写道:
> > > >
> > > > > Thanks Aitozi for initiating this discussion. For the data cache, I
> > > have
> > > > > some questions:
> > > > >
> > > > > 1. In the design document, the focus is mainly on block cache. In a
> > > > > complete cache system, it is usually divided into distributed
> cache,
> > > local
> > > > > file cache, block cache, and key-value cache. Compared with block
> > > cache,
> > > > > would it be more effective to introduce a distributed cache such as
> > > > > Alluxio?
> > > > >
> > > > > 2. For the computing engine: What interfaces should Paimon's cache
> > > provide
> > > > > so that the computing engine can be aware of which computing nodes
> > > cache
> > > > > which data, and facilitate the deployment of computing tasks to the
> > > > > appropriate computing nodes at the scheduling layer?
> > > > >
> > > > > Best,
> > > > > FangYong
> > > > >
> > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <[email protected]>
> wrote:
> > > > >
> > > > > > Hi devs:
> > > > > >     I want to initiate a discussion on the ability to support
> data
> > > cache
> > > > > in
> > > > > > the Paimon reader, aiming to accelerate the performance of scan
> > > operators
> > > > > > in analytical scenarios. The detailed design document is as
> follows
> > > [1].
> > > > > > Looking forward to your feedback.
> > > > > >
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> > > > > >
> > > > > > Thanks
> > > > > > Aitozi.
> > > > > >
> > > > >
> > >
>

Reply via email to