Hi Jingsong: 1. The scan.blockcache.enabled will decide whether to enable the cache 2. The static object (BlockCacheManager) maintains a singleton BlockCache 3. Currently not for manifest
I just opened a poc PR for a closer look https://github.com/apache/paimon/pull/3807 Thanks, Aitozi Jingsong Li <jingsongl...@gmail.com> 于2024年7月24日周三 16:23写道: > Hi Aitozi, > > Can we clarify the following: > 1. What is the configuration for enabling cache? > 2. What object is responsible for maintaining Cache? Table class? > Static object? Unified management of computing engine objects? > 3. Can Cache be applied to the manifest? > > Best, > Jingsong > > On Wed, Jul 24, 2024 at 10:30 AM Aitozi <gjying1...@gmail.com> wrote: > > > > Hi Jingsong > > I have updated the wiki with the API section. Please review it > again. > > > > Thanks, > > Aitozi > > > > Jingsong Li <jingsongl...@gmail.com> 于2024年7月23日周二 18:20写道: > > > > > Thanks Aitozi for starting this discussion. > > > > > > +1 to have a block cache. > > > > > > I suggest you add where we need to modify and what the core API is. > > > > > > Best, > > > Jingsong > > > > > > On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com> wrote: > > > > > > > > Hi, wj > > > > Thanks for your comments. > > > > (1) In an OLAP system, the same query may be executed multiple > times, and > > > > different snapshots may share the same data file. > > > > Therefore, caching can help reduce the need to fetch data from remote > > > > storage. > > > > (2) Both CachedSeekableInputStream and BlockCache will be used, the > > > > CachedSeekableInputStream will use BlockCache to find the target > block > > > > (3) A BlockQueue holds the list of available blocks that can be used > to > > > > store data. > > > > > > > > Thanks, > > > > Aitozi. > > > > > > > > wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道: > > > > > > > > > Thanks Aitozi for initiating this discussion. > > > > > I have some questions: > > > > > > > > > > (1) Why need this cache in the analysis senior? When scan a > snapshot, > > > > > why a dataFile will be read multiple times? > > > > > (2) CachedSeekableInputStream and BlockCache, which implementation > do > > > > > you prefer to choose? > > > > > (3) In BlockCache, why introduce a BlockQueue? > > > > > > > > > > Best, > > > > > wangwj > > > > > > > > > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com> > wrote: > > > > > > > > > > > > Hi, Fang Yong > > > > > > > > > > > > Thanks for your valuable comments. Here are some of my > thoughts > > > on > > > > > your > > > > > > question > > > > > > > > > > > > (1) The distributed cache and local file cache actually work in > > > different > > > > > > locations, and their functions are orthogonal. > > > > > > Therefore, I believe that these two can be used together. So this > > > > > proposal > > > > > > mainly focus on the local cache > > > > > > (2) In our design, the scheduler utilizes the consistent hash > > > strategy to > > > > > > assign DataSplits to computing nodes, > > > > > > enabling cache colocation scheduling. > > > > > > > > > > > > Repost the doc on wiki page: > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader > > > > > > > > > > > > Thanks, > > > > > > Aitozi. > > > > > > > > > > > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道: > > > > > > > > > > > > > Thanks Aitozi for initiating this discussion. For the data > cache, I > > > > > have > > > > > > > some questions: > > > > > > > > > > > > > > 1. In the design document, the focus is mainly on block cache. > In a > > > > > > > complete cache system, it is usually divided into distributed > > > cache, > > > > > local > > > > > > > file cache, block cache, and key-value cache. Compared with > block > > > > > cache, > > > > > > > would it be more effective to introduce a distributed cache > such as > > > > > > > Alluxio? > > > > > > > > > > > > > > 2. For the computing engine: What interfaces should Paimon's > cache > > > > > provide > > > > > > > so that the computing engine can be aware of which computing > nodes > > > > > cache > > > > > > > which data, and facilitate the deployment of computing tasks > to the > > > > > > > appropriate computing nodes at the scheduling layer? > > > > > > > > > > > > > > Best, > > > > > > > FangYong > > > > > > > > > > > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <gjying1...@gmail.com> > > > wrote: > > > > > > > > > > > > > > > Hi devs: > > > > > > > > I want to initiate a discussion on the ability to support > > > data > > > > > cache > > > > > > > in > > > > > > > > the Paimon reader, aiming to accelerate the performance of > scan > > > > > operators > > > > > > > > in analytical scenarios. The detailed design document is as > > > follows > > > > > [1]. > > > > > > > > Looking forward to your feedback. > > > > > > > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing > > > > > > > > > > > > > > > > Thanks > > > > > > > > Aitozi. > > > > > > > > > > > > > > > > > > > > > > > >