Thanks Aitozi for starting this discussion. +1 to have a block cache.
I suggest you add where we need to modify and what the core API is. Best, Jingsong On Tue, Jul 16, 2024 at 5:59 PM Aitozi <gjying1...@gmail.com> wrote: > > Hi, wj > Thanks for your comments. > (1) In an OLAP system, the same query may be executed multiple times, and > different snapshots may share the same data file. > Therefore, caching can help reduce the need to fetch data from remote > storage. > (2) Both CachedSeekableInputStream and BlockCache will be used, the > CachedSeekableInputStream will use BlockCache to find the target block > (3) A BlockQueue holds the list of available blocks that can be used to > store data. > > Thanks, > Aitozi. > > wj wang <hongli....@gmail.com> 于2024年7月16日周二 17:42写道: > > > Thanks Aitozi for initiating this discussion. > > I have some questions: > > > > (1) Why need this cache in the analysis senior? When scan a snapshot, > > why a dataFile will be read multiple times? > > (2) CachedSeekableInputStream and BlockCache, which implementation do > > you prefer to choose? > > (3) In BlockCache, why introduce a BlockQueue? > > > > Best, > > wangwj > > > > On Tue, Jul 16, 2024 at 3:07 PM Aitozi <gjying1...@gmail.com> wrote: > > > > > > Hi, Fang Yong > > > > > > Thanks for your valuable comments. Here are some of my thoughts on > > your > > > question > > > > > > (1) The distributed cache and local file cache actually work in different > > > locations, and their functions are orthogonal. > > > Therefore, I believe that these two can be used together. So this > > proposal > > > mainly focus on the local cache > > > (2) In our design, the scheduler utilizes the consistent hash strategy to > > > assign DataSplits to computing nodes, > > > enabling cache colocation scheduling. > > > > > > Repost the doc on wiki page: > > > > > > > > https://cwiki.apache.org/confluence/display/PAIMON/PIP-24+Introduce+data+cache+in+paimon+reader > > > > > > Thanks, > > > Aitozi. > > > > > > Yong Fang <zjur...@gmail.com> 于2024年7月16日周二 14:37写道: > > > > > > > Thanks Aitozi for initiating this discussion. For the data cache, I > > have > > > > some questions: > > > > > > > > 1. In the design document, the focus is mainly on block cache. In a > > > > complete cache system, it is usually divided into distributed cache, > > local > > > > file cache, block cache, and key-value cache. Compared with block > > cache, > > > > would it be more effective to introduce a distributed cache such as > > > > Alluxio? > > > > > > > > 2. For the computing engine: What interfaces should Paimon's cache > > provide > > > > so that the computing engine can be aware of which computing nodes > > cache > > > > which data, and facilitate the deployment of computing tasks to the > > > > appropriate computing nodes at the scheduling layer? > > > > > > > > Best, > > > > FangYong > > > > > > > > On Tue, Jul 16, 2024 at 10:45 AM Aitozi <gjying1...@gmail.com> wrote: > > > > > > > > > Hi devs: > > > > > I want to initiate a discussion on the ability to support data > > cache > > > > in > > > > > the Paimon reader, aiming to accelerate the performance of scan > > operators > > > > > in analytical scenarios. The detailed design document is as follows > > [1]. > > > > > Looking forward to your feedback. > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1-zzDpxcubukMR-21n66OPv2ViKEFeEJ_Mivc-wW4gLM/edit?usp=sharing > > > > > > > > > > Thanks > > > > > Aitozi. > > > > > > > > > > >