Re: [DISCUSS] Introduce data cache in Paimon reader

Jingsong Li Wed, 31 Jul 2024 02:49:40 -0700

Hi,

I see the current implementation is using a static field through Format?


Maybe it is better to put the cache in FileIO? Through catalog options?

Best,
Jingsong

On Wed, Jul 31, 2024 at 2:49 PM Aitozi <gjying1...@gmail.com> wrote:
>
> Hi guys:
>     Are there any further comments on this proposal? If not, I would like
> to start a voting thread.
>
> Thanks,
> Aitozi.
>
> Aitozi <gjying1...@gmail.com> 于2024年7月24日周三 19:46写道：
>
> > Hi Jingsong:
> >
> > 1. The scan.blockcache.enabled will decide whether to enable the cache
> > 2. The static object (BlockCacheManager) maintains a singleton BlockCache
> > 3. Currently not for manifest
> >
> > I just opened a poc PR for a closer look
> > https://github.com/apache/paimon/pull/3807
> >
> > Thanks,
> > Aitozi
> >
> > Jingsong Li <jingsongl...@gmail.com> 于2024年7月24日周三 16:23写道：
> >
> >> Hi Aitozi,
> >>
> >> Can we clarify the following:
> >> 1. What is the configuration for enabling cache?
> >> 2. What object is responsible for maintaining Cache? Table class?
> >> Static object? Unified management of computing engine objects?
> >> 3. Can Cache be applied to the manifest?
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Wed, Jul 24, 2024 at 10:30 AM Aitozi <gjying1...@gmail.com> wrote:
> >> >
> >> > Hi Jingsong
> >> >      I have updated the wiki with the API section. Please review it
> >> again.
> >> >
> >> > Thanks,
> >> > Aitozi
> >> >
> >> > Jingsong Li <jingsongl...@gmail.com> 于2024年7月23日周二 18:20写道：
> >> >
> >> > > Thanks Aitozi for starting this discussion.
> >> > >
> >> > > +1 to have a block cache.
> >> > >
> >> > > I suggest you add where we need to modify and what the core API is.
> >> > >
> >> > > Best,
> >> > > Jingsong
> >> > >
> >> > > On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com> wrote:
> >> > > >
> >> > > > Hi, wj
> >> > > >     Thanks for your comments.
> >> > > > (1) In an OLAP system, the same query may be executed multiple
> >> times, and
> >> > > > different snapshots may share the same data file.
> >> > > > Therefore, caching can help reduce the need to fetch data from
> >> remote
> >> > > > storage.
> >> > > > (2) Both CachedSeekableInputStream and BlockCache will be used, the
> >> > > > CachedSeekableInputStream will use BlockCache to find the target
> >> block
> >> > > > (3) A BlockQueue holds the list of available blocks that can be
> >> used to
> >> > > > store data.
> >> > > >
> >> > > > Thanks,
> >> > > > Aitozi.
> >> > > >
> >> > > > wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道：
> >> > > >
> >> > > > > Thanks Aitozi for initiating this discussion.
> >> > > > > I have some questions:
> >> > > > >
> >> > > > > (1) Why need this cache in the analysis senior? When scan a
> >> snapshot,
> >> > > > > why a dataFile will be read multiple times?
> >> > > > > (2) CachedSeekableInputStream and BlockCache, which
> >> implementation do
> >> > > > > you prefer to choose?
> >> > > > > (3) In BlockCache, why introduce a BlockQueue?
> >> > > > >
> >> > > > > Best,
> >> > > > > wangwj
> >> > > > >
> >> > > > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com>
> >> wrote:
> >> > > > > >
> >> > > > > > Hi, Fang Yong
> >> > > > > >
> >> > > > > >     Thanks for your valuable comments. Here are some of my
> >> thoughts
> >> > > on
> >> > > > > your
> >> > > > > > question
> >> > > > > >
> >> > > > > > (1) The distributed cache and local file cache actually work in
> >> > > different
> >> > > > > > locations, and their functions are orthogonal.
> >> > > > > > Therefore, I believe that these two can be used together. So
> >> this
> >> > > > > proposal
> >> > > > > > mainly focus on the local cache
> >> > > > > > (2) In our design, the scheduler utilizes the consistent hash
> >> > > strategy to
> >> > > > > > assign DataSplits to computing nodes,
> >> > > > > > enabling cache colocation scheduling.
> >> > > > > >
> >> > > > > > Repost the doc on wiki page:
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > >
> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> >> > > > > >
> >> > > > > > Thanks,
> >> > > > > > Aitozi.
> >> > > > > >
> >> > > > > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道：
> >> > > > > >
> >> > > > > > > Thanks Aitozi for initiating this discussion. For the data
> >> cache, I
> >> > > > > have
> >> > > > > > > some questions:
> >> > > > > > >
> >> > > > > > > 1. In the design document, the focus is mainly on block
> >> cache. In a
> >> > > > > > > complete cache system, it is usually divided into distributed
> >> > > cache,
> >> > > > > local
> >> > > > > > > file cache, block cache, and key-value cache. Compared with
> >> block
> >> > > > > cache,
> >> > > > > > > would it be more effective to introduce a distributed cache
> >> such as
> >> > > > > > > Alluxio?
> >> > > > > > >
> >> > > > > > > 2. For the computing engine: What interfaces should Paimon's
> >> cache
> >> > > > > provide
> >> > > > > > > so that the computing engine can be aware of which computing
> >> nodes
> >> > > > > cache
> >> > > > > > > which data, and facilitate the deployment of computing tasks
> >> to the
> >> > > > > > > appropriate computing nodes at the scheduling layer?
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > FangYong
> >> > > > > > >
> >> > > > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <gjying1...@gmail.com
> >> >
> >> > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi devs:
> >> > > > > > > >     I want to initiate a discussion on the ability to
> >> support
> >> > > data
> >> > > > > cache
> >> > > > > > > in
> >> > > > > > > > the Paimon reader, aiming to accelerate the performance of
> >> scan
> >> > > > > operators
> >> > > > > > > > in analytical scenarios. The detailed design document is as
> >> > > follows
> >> > > > > [1].
> >> > > > > > > > Looking forward to your feedback.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > [1]:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> >> > > > > > > >
> >> > > > > > > > Thanks
> >> > > > > > > > Aitozi.
> >> > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > >
> >>
> >

Re: [DISCUSS] Introduce data cache in Paimon reader

Reply via email to