Hi Jingsong:

1. The scan.blockcache.enabled will decide whether to enable the cache
2. The static object (BlockCacheManager) maintains a singleton BlockCache
3. Currently not for manifest

I just opened a poc PR for a closer look
https://github.com/apache/paimon/pull/3807

Thanks,
Aitozi

Jingsong Li <jingsongl...@gmail.com> 于2024年7月24日周三 16:23写道:

> Hi Aitozi,
>
> Can we clarify the following:
> 1. What is the configuration for enabling cache?
> 2. What object is responsible for maintaining Cache? Table class?
> Static object? Unified management of computing engine objects?
> 3. Can Cache be applied to the manifest?
>
> Best,
> Jingsong
>
> On Wed, Jul 24, 2024 at 10:30 AM Aitozi <gjying1...@gmail.com> wrote:
> >
> > Hi Jingsong
> >      I have updated the wiki with the API section. Please review it
> again.
> >
> > Thanks,
> > Aitozi
> >
> > Jingsong Li <jingsongl...@gmail.com> 于2024年7月23日周二 18:20写道:
> >
> > > Thanks Aitozi for starting this discussion.
> > >
> > > +1 to have a block cache.
> > >
> > > I suggest you add where we need to modify and what the core API is.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com> wrote:
> > > >
> > > > Hi, wj
> > > >     Thanks for your comments.
> > > > (1) In an OLAP system, the same query may be executed multiple
> times, and
> > > > different snapshots may share the same data file.
> > > > Therefore, caching can help reduce the need to fetch data from remote
> > > > storage.
> > > > (2) Both CachedSeekableInputStream and BlockCache will be used, the
> > > > CachedSeekableInputStream will use BlockCache to find the target
> block
> > > > (3) A BlockQueue holds the list of available blocks that can be used
> to
> > > > store data.
> > > >
> > > > Thanks,
> > > > Aitozi.
> > > >
> > > > wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道:
> > > >
> > > > > Thanks Aitozi for initiating this discussion.
> > > > > I have some questions:
> > > > >
> > > > > (1) Why need this cache in the analysis senior? When scan a
> snapshot,
> > > > > why a dataFile will be read multiple times?
> > > > > (2) CachedSeekableInputStream and BlockCache, which implementation
> do
> > > > > you prefer to choose?
> > > > > (3) In BlockCache, why introduce a BlockQueue?
> > > > >
> > > > > Best,
> > > > > wangwj
> > > > >
> > > > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com>
> wrote:
> > > > > >
> > > > > > Hi, Fang Yong
> > > > > >
> > > > > >     Thanks for your valuable comments. Here are some of my
> thoughts
> > > on
> > > > > your
> > > > > > question
> > > > > >
> > > > > > (1) The distributed cache and local file cache actually work in
> > > different
> > > > > > locations, and their functions are orthogonal.
> > > > > > Therefore, I believe that these two can be used together. So this
> > > > > proposal
> > > > > > mainly focus on the local cache
> > > > > > (2) In our design, the scheduler utilizes the consistent hash
> > > strategy to
> > > > > > assign DataSplits to computing nodes,
> > > > > > enabling cache colocation scheduling.
> > > > > >
> > > > > > Repost the doc on wiki page:
> > > > > >
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> > > > > >
> > > > > > Thanks,
> > > > > > Aitozi.
> > > > > >
> > > > > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道:
> > > > > >
> > > > > > > Thanks Aitozi for initiating this discussion. For the data
> cache, I
> > > > > have
> > > > > > > some questions:
> > > > > > >
> > > > > > > 1. In the design document, the focus is mainly on block cache.
> In a
> > > > > > > complete cache system, it is usually divided into distributed
> > > cache,
> > > > > local
> > > > > > > file cache, block cache, and key-value cache. Compared with
> block
> > > > > cache,
> > > > > > > would it be more effective to introduce a distributed cache
> such as
> > > > > > > Alluxio?
> > > > > > >
> > > > > > > 2. For the computing engine: What interfaces should Paimon's
> cache
> > > > > provide
> > > > > > > so that the computing engine can be aware of which computing
> nodes
> > > > > cache
> > > > > > > which data, and facilitate the deployment of computing tasks
> to the
> > > > > > > appropriate computing nodes at the scheduling layer?
> > > > > > >
> > > > > > > Best,
> > > > > > > FangYong
> > > > > > >
> > > > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <gjying1...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Hi devs:
> > > > > > > >     I want to initiate a discussion on the ability to support
> > > data
> > > > > cache
> > > > > > > in
> > > > > > > > the Paimon reader, aiming to accelerate the performance of
> scan
> > > > > operators
> > > > > > > > in analytical scenarios. The detailed design document is as
> > > follows
> > > > > [1].
> > > > > > > > Looking forward to your feedback.
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Aitozi.
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Reply via email to