Re: [DISCUSS] Introduce data cache in Paimon reader

Aitozi Thu, 01 Aug 2024 07:35:07 -0700

Hi,

Yes, currently it's enabled by the config through the format. Wrap this
cache in
the FileIO looks more natural. Will follow this suggestion.


BTW, this cache should only be enabled in the analysis case such as session
scan node,
so this cache config usually *should be a dynamic session variable, not a
default table *
*option*. Otherwise, it may influence the compaction scan process.

Best,
Aitozi.

Jingsong Li <jingsongl...@gmail.com> 于2024年7月31日周三 17:49写道：

> Hi,
>
> I see the current implementation is using a static field through Format?
>
> Maybe it is better to put the cache in FileIO? Through catalog options?
>
> Best,
> Jingsong
>
> On Wed, Jul 31, 2024 at 2:49 PM Aitozi <gjying1...@gmail.com> wrote:
> >
> > Hi guys:
> >     Are there any further comments on this proposal? If not, I would like
> > to start a voting thread.
> >
> > Thanks,
> > Aitozi.
> >
> > Aitozi <gjying1...@gmail.com> 于2024年7月24日周三 19:46写道：
> >
> > > Hi Jingsong:
> > >
> > > 1. The scan.blockcache.enabled will decide whether to enable the cache
> > > 2. The static object (BlockCacheManager) maintains a singleton
> BlockCache
> > > 3. Currently not for manifest
> > >
> > > I just opened a poc PR for a closer look
> > > https://github.com/apache/paimon/pull/3807
> > >
> > > Thanks,
> > > Aitozi
> > >
> > > Jingsong Li <jingsongl...@gmail.com> 于2024年7月24日周三 16:23写道：
> > >
> > >> Hi Aitozi,
> > >>
> > >> Can we clarify the following:
> > >> 1. What is the configuration for enabling cache?
> > >> 2. What object is responsible for maintaining Cache? Table class?
> > >> Static object? Unified management of computing engine objects?
> > >> 3. Can Cache be applied to the manifest?
> > >>
> > >> Best,
> > >> Jingsong
> > >>
> > >> On Wed, Jul 24, 2024 at 10:30 AM Aitozi <gjying1...@gmail.com> wrote:
> > >> >
> > >> > Hi Jingsong
> > >> >      I have updated the wiki with the API section. Please review it
> > >> again.
> > >> >
> > >> > Thanks,
> > >> > Aitozi
> > >> >
> > >> > Jingsong Li <jingsongl...@gmail.com> 于2024年7月23日周二 18:20写道：
> > >> >
> > >> > > Thanks Aitozi for starting this discussion.
> > >> > >
> > >> > > +1 to have a block cache.
> > >> > >
> > >> > > I suggest you add where we need to modify and what the core API
> is.
> > >> > >
> > >> > > Best,
> > >> > > Jingsong
> > >> > >
> > >> > > On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com>
> wrote:
> > >> > > >
> > >> > > > Hi, wj
> > >> > > >     Thanks for your comments.
> > >> > > > (1) In an OLAP system, the same query may be executed multiple
> > >> times, and
> > >> > > > different snapshots may share the same data file.
> > >> > > > Therefore, caching can help reduce the need to fetch data from
> > >> remote
> > >> > > > storage.
> > >> > > > (2) Both CachedSeekableInputStream and BlockCache will be used,
> the
> > >> > > > CachedSeekableInputStream will use BlockCache to find the target
> > >> block
> > >> > > > (3) A BlockQueue holds the list of available blocks that can be
> > >> used to
> > >> > > > store data.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Aitozi.
> > >> > > >
> > >> > > > wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道：
> > >> > > >
> > >> > > > > Thanks Aitozi for initiating this discussion.
> > >> > > > > I have some questions:
> > >> > > > >
> > >> > > > > (1) Why need this cache in the analysis senior? When scan a
> > >> snapshot,
> > >> > > > > why a dataFile will be read multiple times?
> > >> > > > > (2) CachedSeekableInputStream and BlockCache, which
> > >> implementation do
> > >> > > > > you prefer to choose?
> > >> > > > > (3) In BlockCache, why introduce a BlockQueue?
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > wangwj
> > >> > > > >
> > >> > > > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com>
> > >> wrote:
> > >> > > > > >
> > >> > > > > > Hi, Fang Yong
> > >> > > > > >
> > >> > > > > >     Thanks for your valuable comments. Here are some of my
> > >> thoughts
> > >> > > on
> > >> > > > > your
> > >> > > > > > question
> > >> > > > > >
> > >> > > > > > (1) The distributed cache and local file cache actually
> work in
> > >> > > different
> > >> > > > > > locations, and their functions are orthogonal.
> > >> > > > > > Therefore, I believe that these two can be used together. So
> > >> this
> > >> > > > > proposal
> > >> > > > > > mainly focus on the local cache
> > >> > > > > > (2) In our design, the scheduler utilizes the consistent
> hash
> > >> > > strategy to
> > >> > > > > > assign DataSplits to computing nodes,
> > >> > > > > > enabling cache colocation scheduling.
> > >> > > > > >
> > >> > > > > > Repost the doc on wiki page:
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > >
> > >>
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Aitozi.
> > >> > > > > >
> > >> > > > > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道：
> > >> > > > > >
> > >> > > > > > > Thanks Aitozi for initiating this discussion. For the data
> > >> cache, I
> > >> > > > > have
> > >> > > > > > > some questions:
> > >> > > > > > >
> > >> > > > > > > 1. In the design document, the focus is mainly on block
> > >> cache. In a
> > >> > > > > > > complete cache system, it is usually divided into
> distributed
> > >> > > cache,
> > >> > > > > local
> > >> > > > > > > file cache, block cache, and key-value cache. Compared
> with
> > >> block
> > >> > > > > cache,
> > >> > > > > > > would it be more effective to introduce a distributed
> cache
> > >> such as
> > >> > > > > > > Alluxio?
> > >> > > > > > >
> > >> > > > > > > 2. For the computing engine: What interfaces should
> Paimon's
> > >> cache
> > >> > > > > provide
> > >> > > > > > > so that the computing engine can be aware of which
> computing
> > >> nodes
> > >> > > > > cache
> > >> > > > > > > which data, and facilitate the deployment of computing
> tasks
> > >> to the
> > >> > > > > > > appropriate computing nodes at the scheduling layer?
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > FangYong
> > >> > > > > > >
> > >> > > > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <
> gjying1...@gmail.com
> > >> >
> > >> > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi devs:
> > >> > > > > > > >     I want to initiate a discussion on the ability to
> > >> support
> > >> > > data
> > >> > > > > cache
> > >> > > > > > > in
> > >> > > > > > > > the Paimon reader, aiming to accelerate the performance
> of
> > >> scan
> > >> > > > > operators
> > >> > > > > > > > in analytical scenarios. The detailed design document
> is as
> > >> > > follows
> > >> > > > > [1].
> > >> > > > > > > > Looking forward to your feedback.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > [1]:
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing
> > >> > > > > > > >
> > >> > > > > > > > Thanks
> > >> > > > > > > > Aitozi.
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> > >
>

Re: [DISCUSS] Introduce data cache in Paimon reader

Reply via email to