Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Becket Qin Wed, 01 Jun 2022 22:50:54 -0700

Thanks for updating the FLIP, Qingsheng. A few more comments:

1. I am still not sure about what is the use case for cacheMissingKey().
More specifically, when would users want to have getCache() return a
non-empty value and cacheMissingKey() returns false?


2. The builder pattern. Usually the builder pattern is used when there are
a lot of variations of constructors. For example, if a class has three
variables and all of them are optional, so there could potentially be many
combinations of the variables. But in this FLIP, I don't see such case.
What is the reason we have builders for all the classes?

3. Should the caching strategy be excluded from the top level provider API?
Technically speaking, the Flink framework should only have two interfaces
to deal with:
    A) LookupFunction
    B) AsyncLookupFunction
Orthogonally, we *believe* there are two different strategies people can do
caching. Note that the Flink framework does not care what is the caching
strategy here.
    a) partial caching
    b) full caching

Putting them together, we end up with 3 combinations that we think are
valid:
     Aa) PartialCachingLookupFunctionProvider
     Ba) PartialCachingAsyncLookupFunctionProvider
     Ab) FullCachingLookupFunctionProvider

However, the caching strategy could actually be quite flexible. E.g. an
initial full cache load followed by some partial updates. Also, I am not
100% sure if the full caching will always use ScanTableSource. Including
the caching strategy in the top level provider API would make it harder to
extend.

One possible solution is to just have *LookupFunctionProvider* and
*AsyncLookupFunctionProvider
*as the top level API, both with a getCacheStrategy() method returning an
optional CacheStrategy. The CacheStrategy class would have the following
methods:
1. void open(Context), the context exposes some of the resources that may
be useful for the the caching strategy, e.g. an ExecutorService that is
synchronized with the data processing, or a cache refresh trigger which
blocks data processing and refresh the cache.
2. void initializeCache(), a blocking method allows users to pre-populate
the cache before processing any data if they wish.
3. void maybeCache(RowData key, Collection<RowData> value), blocking or
non-blocking method.
4. void refreshCache(), a blocking / non-blocking method that is invoked by
the Flink framework when the cache refresh trigger is pulled.

In the above design, partial caching and full caching would be
implementations of the CachingStrategy. And it is OK for users to implement
their own CachingStrategy if they want to.

Thanks,

Jiangjie (Becket) Qin


On Thu, Jun 2, 2022 at 12:14 PM Jark Wu <[email protected]> wrote:

> Thank Qingsheng for the detailed summary and updates,
>
> The changes look good to me in general. I just have one minor improvement
> comment.
> Could we add a static util method to the "FullCachingReloadTrigger"
> interface for quick usage?
>
> #periodicReloadAtFixedRate(Duration)
> #periodicReloadWithFixedDelay(Duration)
>
> I think we can also do this for LookupCache. Because users may not know
> where is the default
> implementations and how to use them.
>
> Best,
> Jark
>
>
>
>
>
>
> On Wed, 1 Jun 2022 at 18:32, Qingsheng Ren <[email protected]> wrote:
>
> > Hi Jingsong,
> >
> > Thanks for your comments!
> >
> > > AllCache definition is not flexible, for example, PartialCache can use
> > any custom storage, while the AllCache can not, AllCache can also be
> > considered to store memory or disk, also need a flexible strategy.
> >
> > We had an offline discussion with Jark and Leonard. Basically we think
> > exposing the interface of full cache storage to connector developers
> might
> > limit our future optimizations. The storage of full caching shouldn’t
> have
> > too many variations for different lookup tables so making it pluggable
> > might not help a lot. Also I think it is not quite easy for connector
> > developers to implement such an optimized storage. We can keep optimizing
> > this storage in the future and all full caching lookup tables would
> benefit
> > from this.
> >
> > > We are more inclined to deprecate the connector `async` option when
> > discussing FLIP-234. Can we remove this option from this FLIP?
> >
> > Thanks for the reminder! This option has been removed in the latest
> > version.
> >
> > Best regards,
> >
> > Qingsheng
> >
> >
> > > On Jun 1, 2022, at 15:28, Jingsong Li <[email protected]> wrote:
> > >
> > > Thanks Alexander for your reply. We can discuss the new interface when
> it
> > > comes out.
> > >
> > > We are more inclined to deprecate the connector `async` option when
> > > discussing FLIP-234 [1]. We should use hint to let planner decide.
> > > Although the discussion has not yet produced a conclusion, can we
> remove
> > > this option from this FLIP? It doesn't seem to be related to this FLIP,
> > but
> > > more to FLIP-234, and we can form a conclusion over there.
> > >
> > > [1] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Wed, Jun 1, 2022 at 4:59 AM Jing Ge <[email protected]> wrote:
> > >
> > >> Hi Jark,
> > >>
> > >> Thanks for clarifying it. It would be fine. as long as we could
> provide
> > the
> > >> no-cache solution. I was just wondering if the client side cache could
> > >> really help when HBase is used, since the data to look up should be
> > huge.
> > >> Depending how much data will be cached on the client side, the data
> that
> > >> should be lru in e.g. LruBlockCache will not be lru anymore. In the
> > worst
> > >> case scenario, once the cached data at client side is expired, the
> > request
> > >> will hit disk which will cause extra latency temporarily, if I am not
> > >> mistaken.
> > >>
> > >> Best regards,
> > >> Jing
> > >>
> > >> On Mon, May 30, 2022 at 9:59 AM Jark Wu <[email protected]> wrote:
> > >>
> > >>> Hi Jing Ge,
> > >>>
> > >>> What do you mean about the "impact on the block cache used by HBase"?
> > >>> In my understanding, the connector cache and HBase cache are totally
> > two
> > >>> things.
> > >>> The connector cache is a local/client cache, and the HBase cache is a
> > >>> server cache.
> > >>>
> > >>>> does it make sense to have a no-cache solution as one of the
> > >>> default solutions so that customers will have no effort for the
> > migration
> > >>> if they want to stick with Hbase cache
> > >>>
> > >>> The implementation migration should be transparent to users. Take the
> > >> HBase
> > >>> connector as
> > >>> an example,  it already supports lookup cache but is disabled by
> > default.
> > >>> After migration, the
> > >>> connector still disables cache by default (i.e. no-cache solution).
> No
> > >>> migration effort for users.
> > >>>
> > >>> HBase cache and connector cache are two different things. HBase cache
> > >> can't
> > >>> simply replace
> > >>> connector cache. Because one of the most important usages for
> connector
> > >>> cache is reducing
> > >>> the I/O request/response and improving the throughput, which can
> > achieve
> > >>> by just using a server cache.
> > >>>
> > >>> Best,
> > >>> Jark
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, 27 May 2022 at 22:42, Jing Ge <[email protected]> wrote:
> > >>>
> > >>>> Thanks all for the valuable discussion. The new feature looks very
> > >>>> interesting.
> > >>>>
> > >>>> According to the FLIP description: "*Currently we have JDBC, Hive
> and
> > >>> HBase
> > >>>> connector implemented lookup table source. All existing
> > implementations
> > >>>> will be migrated to the current design and the migration will be
> > >>>> transparent to end users*." I was only wondering if we should pay
> > >>> attention
> > >>>> to HBase and similar DBs. Since, commonly, the lookup data will be
> > huge
> > >>>> while using HBase, partial caching will be used in this case, if I
> am
> > >> not
> > >>>> mistaken, which might have an impact on the block cache used by
> HBase,
> > >>> e.g.
> > >>>> LruBlockCache.
> > >>>> Another question is that, since HBase provides a sophisticated cache
> > >>>> solution, does it make sense to have a no-cache solution as one of
> the
> > >>>> default solutions so that customers will have no effort for the
> > >> migration
> > >>>> if they want to stick with Hbase cache?
> > >>>>
> > >>>> Best regards,
> > >>>> Jing
> > >>>>
> > >>>> On Fri, May 27, 2022 at 11:19 AM Jingsong Li <
> [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I think the problem now is below:
> > >>>>> 1. AllCache and PartialCache interface on the non-uniform, one
> needs
> > >> to
> > >>>>> provide LookupProvider, the other needs to provide CacheBuilder.
> > >>>>> 2. AllCache definition is not flexible, for example, PartialCache
> can
> > >>> use
> > >>>>> any custom storage, while the AllCache can not, AllCache can also
> be
> > >>>>> considered to store memory or disk, also need a flexible strategy.
> > >>>>> 3. AllCache can not customize ReloadStrategy, currently only
> > >>>>> ScheduledReloadStrategy.
> > >>>>>
> > >>>>> In order to solve the above problems, the following are my ideas.
> > >>>>>
> > >>>>> ## Top level cache interfaces:
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> public interface CacheLookupProvider extends
> > >>>>> LookupTableSource.LookupRuntimeProvider {
> > >>>>>
> > >>>>>    CacheBuilder createCacheBuilder();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface CacheBuilder {
> > >>>>>    Cache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface Cache {
> > >>>>>
> > >>>>>    /**
> > >>>>>     * Returns the value associated with key in this cache, or null
> > >> if
> > >>>>> there is no cached value for
> > >>>>>     * key.
> > >>>>>     */
> > >>>>>    @Nullable
> > >>>>>    Collection<RowData> getIfPresent(RowData key);
> > >>>>>
> > >>>>>    /** Returns the number of key-value mappings in the cache. */
> > >>>>>    long size();
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> ## Partial cache
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> public interface PartialCacheLookupFunction extends
> > >>> CacheLookupProvider {
> > >>>>>
> > >>>>>    @Override
> > >>>>>    PartialCacheBuilder createCacheBuilder();
> > >>>>>
> > >>>>> /** Creates an {@link LookupFunction} instance. */
> > >>>>> LookupFunction createLookupFunction();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface PartialCacheBuilder extends CacheBuilder {
> > >>>>>
> > >>>>>    PartialCache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface PartialCache extends Cache {
> > >>>>>
> > >>>>>    /**
> > >>>>>     * Associates the specified value rows with the specified key
> row
> > >>>>> in the cache. If the cache
> > >>>>>     * previously contained value associated with the key, the old
> > >>>>> value is replaced by the
> > >>>>>     * specified value.
> > >>>>>     *
> > >>>>>     * @return the previous value rows associated with key, or null
> > >> if
> > >>>>> there was no mapping for key.
> > >>>>>     * @param key - key row with which the specified value is to be
> > >>>>> associated
> > >>>>>     * @param value – value rows to be associated with the specified
> > >>> key
> > >>>>>     */
> > >>>>>    Collection<RowData> put(RowData key, Collection<RowData> value);
> > >>>>>
> > >>>>>    /** Discards any cached value for the specified key. */
> > >>>>>    void invalidate(RowData key);
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> ## All cache
> > >>>>> ```
> > >>>>>
> > >>>>> public interface AllCacheLookupProvider extends
> CacheLookupProvider {
> > >>>>>
> > >>>>>    void registerReloadStrategy(ScheduledExecutorService
> > >>>>> executorService, Reloader reloader);
> > >>>>>
> > >>>>>    ScanTableSource.ScanRuntimeProvider getScanRuntimeProvider();
> > >>>>>
> > >>>>>    @Override
> > >>>>>    AllCacheBuilder createCacheBuilder();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface AllCacheBuilder extends CacheBuilder {
> > >>>>>
> > >>>>>    AllCache create();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface AllCache extends Cache {
> > >>>>>
> > >>>>>    void putAll(Iterator<Map<RowData, RowData>> allEntries);
> > >>>>>
> > >>>>>    void clearAll();
> > >>>>> }
> > >>>>>
> > >>>>>
> > >>>>> public interface Reloader {
> > >>>>>
> > >>>>>    void reload();
> > >>>>> }
> > >>>>>
> > >>>>> ```
> > >>>>>
> > >>>>> Best,
> > >>>>> Jingsong
> > >>>>>
> > >>>>> On Fri, May 27, 2022 at 11:10 AM Jingsong Li <
> [email protected]
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Qingsheng and all for your discussion.
> > >>>>>>
> > >>>>>> Very sorry to jump in so late.
> > >>>>>>
> > >>>>>> Maybe I missed something?
> > >>>>>> My first impression when I saw the cache interface was, why don't
> > >> we
> > >>>>>> provide an interface similar to guava cache [1], on top of guava
> > >>> cache,
> > >>>>>> caffeine also makes extensions for asynchronous calls.[2]
> > >>>>>> There is also the bulk load in caffeine too.
> > >>>>>>
> > >>>>>> I am also more confused why first from LookupCacheFactory.Builder
> > >> and
> > >>>>> then
> > >>>>>> to Factory to create Cache.
> > >>>>>>
> > >>>>>> [1] https://github.com/google/guava
> > >>>>>> [2] https://github.com/ben-manes/caffeine/wiki/Population
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Jingsong
> > >>>>>>
> > >>>>>> On Thu, May 26, 2022 at 11:17 PM Jark Wu <[email protected]>
> wrote:
> > >>>>>>
> > >>>>>>> After looking at the new introduced ReloadTime and Becket's
> > >> comment,
> > >>>>>>> I agree with Becket we should have a pluggable reloading
> strategy.
> > >>>>>>> We can provide some common implementations, e.g., periodic
> > >>> reloading,
> > >>>>> and
> > >>>>>>> daily reloading.
> > >>>>>>> But there definitely be some connector- or business-specific
> > >>> reloading
> > >>>>>>> strategies, e.g.
> > >>>>>>> notify by a zookeeper watcher, reload once a new Hive partition
> is
> > >>>>>>> complete.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jark
> > >>>>>>>
> > >>>>>>> On Thu, 26 May 2022 at 11:52, Becket Qin <[email protected]>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Qingsheng,
> > >>>>>>>>
> > >>>>>>>> Thanks for updating the FLIP. A few comments / questions below:
> > >>>>>>>>
> > >>>>>>>> 1. Is there a reason that we have both "XXXFactory" and
> > >>>> "XXXProvider".
> > >>>>>>>> What is the difference between them? If they are the same, can
> > >> we
> > >>>> just
> > >>>>>>> use
> > >>>>>>>> XXXFactory everywhere?
> > >>>>>>>>
> > >>>>>>>> 2. Regarding the FullCachingLookupProvider, should the reloading
> > >>>>> policy
> > >>>>>>>> also be pluggable? Periodical reloading could be sometimes be
> > >>> tricky
> > >>>>> in
> > >>>>>>>> practice. For example, if user uses 24 hours as the cache
> > >> refresh
> > >>>>>>> interval
> > >>>>>>>> and some nightly batch job delayed, the cache update may still
> > >> see
> > >>>> the
> > >>>>>>>> stale data.
> > >>>>>>>>
> > >>>>>>>> 3. In DefaultLookupCacheFactory, it looks like InitialCapacity
> > >>>> should
> > >>>>> be
> > >>>>>>>> removed.
> > >>>>>>>>
> > >>>>>>>> 4. The purpose of LookupFunctionProvider#cacheMissingKey()
> > >> seems a
> > >>>>>>> little
> > >>>>>>>> confusing to me. If Optional<LookupCacheFactory>
> > >> getCacheFactory()
> > >>>>>>> returns
> > >>>>>>>> a non-empty factory, doesn't that already indicates the
> > >> framework
> > >>> to
> > >>>>>>> cache
> > >>>>>>>> the missing keys? Also, why is this method returning an
> > >>>>>>> Optional<Boolean>
> > >>>>>>>> instead of boolean?
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>>
> > >>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, May 25, 2022 at 5:07 PM Qingsheng Ren <
> > >> [email protected]
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Lincoln and Jark,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the comments! If the community reaches a consensus
> > >>> that
> > >>>> we
> > >>>>>>> use
> > >>>>>>>>> SQL hint instead of table options to decide whether to use sync
> > >>> or
> > >>>>>>> async
> > >>>>>>>>> mode, it’s indeed not necessary to introduce the “lookup.async”
> > >>>>> option.
> > >>>>>>>>>
> > >>>>>>>>> I think it’s a good idea to let the decision of async made on
> > >>> query
> > >>>>>>>>> level, which could make better optimization with more
> > >> infomation
> > >>>>>>> gathered
> > >>>>>>>>> by planner. Is there any FLIP describing the issue in
> > >>> FLINK-27625?
> > >>>> I
> > >>>>>>>>> thought FLIP-234 is only proposing adding SQL hint for retry on
> > >>>>> missing
> > >>>>>>>>> instead of the entire async mode to be controlled by hint.
> > >>>>>>>>>
> > >>>>>>>>> Best regards,
> > >>>>>>>>>
> > >>>>>>>>> Qingsheng
> > >>>>>>>>>
> > >>>>>>>>>> On May 25, 2022, at 15:13, Lincoln Lee <
> > >> [email protected]
> > >>>>
> > >>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Jark,
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for your reply!
> > >>>>>>>>>>
> > >>>>>>>>>> Currently 'lookup.async' just lies in HBase connector, I have
> > >>> no
> > >>>>> idea
> > >>>>>>>>>> whether or when to remove it (we can discuss it in another
> > >>> issue
> > >>>>> for
> > >>>>>>> the
> > >>>>>>>>>> HBase connector after FLINK-27625 is done), just not add it
> > >>> into
> > >>>> a
> > >>>>>>>>> common
> > >>>>>>>>>> option now.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Jark Wu <[email protected]> 于2022年5月24日周二 20:14写道：
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Lincoln,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I have taken a look at FLIP-234, and I agree with you that
> > >> the
> > >>>>>>>>> connectors
> > >>>>>>>>>>> can
> > >>>>>>>>>>> provide both async and sync runtime providers simultaneously
> > >>>>> instead
> > >>>>>>>>> of one
> > >>>>>>>>>>> of them.
> > >>>>>>>>>>> At that point, "lookup.async" looks redundant. If this
> > >> option
> > >>> is
> > >>>>>>>>> planned to
> > >>>>>>>>>>> be removed
> > >>>>>>>>>>> in the long term, I think it makes sense not to introduce it
> > >>> in
> > >>>>> this
> > >>>>>>>>> FLIP.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, 24 May 2022 at 11:08, Lincoln Lee <
> > >>>> [email protected]
> > >>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Sorry for jumping into the discussion so late. It's a good
> > >>> idea
> > >>>>>>> that
> > >>>>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>> have a common table option. I have a minor comments on
> > >>>>>>> 'lookup.async'
> > >>>>>>>>>>> that
> > >>>>>>>>>>>> not make it a common option:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The table layer abstracts both sync and async lookup
> > >>>>> capabilities,
> > >>>>>>>>>>>> connectors implementers can choose one or both, in the case
> > >>> of
> > >>>>>>>>>>> implementing
> > >>>>>>>>>>>> only one capability(status of the most of existing builtin
> > >>>>>>> connectors)
> > >>>>>>>>>>>> 'lookup.async' will not be used.  And when a connector has
> > >>> both
> > >>>>>>>>>>>> capabilities, I think this choice is more suitable for
> > >> making
> > >>>>>>>>> decisions
> > >>>>>>>>>>> at
> > >>>>>>>>>>>> the query level, for example, table planner can choose the
> > >>>>> physical
> > >>>>>>>>>>>> implementation of async lookup or sync lookup based on its
> > >>> cost
> > >>>>>>>>> model, or
> > >>>>>>>>>>>> users can give query hint based on their own better
> > >>>>>>> understanding.  If
> > >>>>>>>>>>>> there is another common table option 'lookup.async', it may
> > >>>>> confuse
> > >>>>>>>>> the
> > >>>>>>>>>>>> users in the long run.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So, I prefer to leave the 'lookup.async' option in private
> > >>>> place
> > >>>>>>> (for
> > >>>>>>>>> the
> > >>>>>>>>>>>> current hbase connector) and not turn it into a common
> > >>> option.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> WDYT?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Lincoln Lee
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Qingsheng Ren <[email protected]> 于2022年5月23日周一 14:54写道：
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the review! We recently updated the FLIP and
> > >> you
> > >>>> can
> > >>>>>>> find
> > >>>>>>>>>>>> those
> > >>>>>>>>>>>>> changes from my latest email. Since some terminologies has
> > >>>>>>> changed so
> > >>>>>>>>>>>> I’ll
> > >>>>>>>>>>>>> use the new concept for replying your comments.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. Builder vs ‘of’
> > >>>>>>>>>>>>> I’m OK to use builder pattern if we have additional
> > >> optional
> > >>>>>>>>> parameters
> > >>>>>>>>>>>>> for full caching mode (“rescan” previously). The
> > >>>>>>> schedule-with-delay
> > >>>>>>>>>>> idea
> > >>>>>>>>>>>>> looks reasonable to me, but I think we need to redesign
> > >> the
> > >>>>>>> builder
> > >>>>>>>>> API
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> full caching to make it more descriptive for developers.
> > >>> Would
> > >>>>> you
> > >>>>>>>>> mind
> > >>>>>>>>>>>>> sharing your ideas about the API? For accessing the FLIP
> > >>>>> workspace
> > >>>>>>>>> you
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>> just provide your account ID and ping any PMC member
> > >>> including
> > >>>>>>> Jark.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>> We have some discussions these days and propose to
> > >>> introduce 8
> > >>>>>>> common
> > >>>>>>>>>>>>> table options about caching. It has been updated on the
> > >>> FLIP.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>> I think we are on the same page :-)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> For your additional concerns:
> > >>>>>>>>>>>>> 1) The table option has been updated.
> > >>>>>>>>>>>>> 2) We got “lookup.cache” back for configuring whether to
> > >> use
> > >>>>>>> partial
> > >>>>>>>>> or
> > >>>>>>>>>>>>> full caching mode.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On May 19, 2022, at 17:25, Александр Смирнов <
> > >>>>>>> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also I have a few additions:
> > >>>>>>>>>>>>>> 1) maybe rename 'lookup.cache.maximum-size' to
> > >>>>>>>>>>>>>> 'lookup.cache.max-rows'? I think it will be more clear
> > >> that
> > >>>> we
> > >>>>>>> talk
> > >>>>>>>>>>>>>> not about bytes, but about the number of rows. Plus it
> > >> fits
> > >>>>> more,
> > >>>>>>>>>>>>>> considering my optimization with filters.
> > >>>>>>>>>>>>>> 2) How will users enable rescanning? Are we going to
> > >>> separate
> > >>>>>>>>> caching
> > >>>>>>>>>>>>>> and rescanning from the options point of view? Like
> > >>> initially
> > >>>>> we
> > >>>>>>> had
> > >>>>>>>>>>>>>> one option 'lookup.cache' with values LRU / ALL. I think
> > >>> now
> > >>>> we
> > >>>>>>> can
> > >>>>>>>>>>>>>> make a boolean option 'lookup.rescan'. RescanInterval can
> > >>> be
> > >>>>>>>>>>>>>> 'lookup.rescan.interval', etc.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> чт, 19 мая 2022 г. в 14:50, Александр Смирнов <
> > >>>>>>> [email protected]
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hi Qingsheng and Jark,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1. Builders vs 'of'
> > >>>>>>>>>>>>>>> I understand that builders are used when we have
> > >> multiple
> > >>>>>>>>>>> parameters.
> > >>>>>>>>>>>>>>> I suggested them because we could add parameters later.
> > >> To
> > >>>>>>> prevent
> > >>>>>>>>>>>>>>> Builder for ScanRuntimeProvider from looking redundant I
> > >>> can
> > >>>>>>>>> suggest
> > >>>>>>>>>>>>>>> one more config now - "rescanStartTime".
> > >>>>>>>>>>>>>>> It's a time in UTC (LocalTime class) when the first
> > >> reload
> > >>>> of
> > >>>>>>> cache
> > >>>>>>>>>>>>>>> starts. This parameter can be thought of as
> > >> 'initialDelay'
> > >>>>> (diff
> > >>>>>>>>>>>>>>> between current time and rescanStartTime) in method
> > >>>>>>>>>>>>>>> ScheduleExecutorService#scheduleWithFixedDelay [1] . It
> > >>> can
> > >>>> be
> > >>>>>>> very
> > >>>>>>>>>>>>>>> useful when the dimension table is updated by some other
> > >>>>>>> scheduled
> > >>>>>>>>>>> job
> > >>>>>>>>>>>>>>> at a certain time. Or when the user simply wants a
> > >> second
> > >>>> scan
> > >>>>>>>>>>> (first
> > >>>>>>>>>>>>>>> cache reload) be delayed. This option can be used even
> > >>>> without
> > >>>>>>>>>>>>>>> 'rescanInterval' - in this case 'rescanInterval' will be
> > >>> one
> > >>>>>>> day.
> > >>>>>>>>>>>>>>> If you are fine with this option, I would be very glad
> > >> if
> > >>>> you
> > >>>>>>> would
> > >>>>>>>>>>>>>>> give me access to edit FLIP page, so I could add it
> > >> myself
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2. Common table options
> > >>>>>>>>>>>>>>> I also think that FactoryUtil would be overloaded by all
> > >>>> cache
> > >>>>>>>>>>>>>>> options. But maybe unify all suggested options, not only
> > >>> for
> > >>>>>>>>> default
> > >>>>>>>>>>>>>>> cache? I.e. class 'LookupOptions', that unifies default
> > >>>> cache
> > >>>>>>>>>>> options,
> > >>>>>>>>>>>>>>> rescan options, 'async', 'maxRetries'. WDYT?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 3. Retries
> > >>>>>>>>>>>>>>> I'm fine with suggestion close to
> > >>> RetryUtils#tryTimes(times,
> > >>>>>>> call)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ScheduledExecutorService.html#scheduleWithFixedDelay-java.lang.Runnable-long-long-java.util.concurrent.TimeUnit-
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> ср, 18 мая 2022 г. в 16:04, Qingsheng Ren <
> > >>>> [email protected]
> > >>>>>> :
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jark and Alexander,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for your comments! I’m also OK to introduce
> > >> common
> > >>>>> table
> > >>>>>>>>>>>>> options. I prefer to introduce a new
> > >>> DefaultLookupCacheOptions
> > >>>>>>> class
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>> holding these option definitions because putting all
> > >> options
> > >>>>> into
> > >>>>>>>>>>>>> FactoryUtil would make it a bit ”crowded” and not well
> > >>>>>>> categorized.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> FLIP has been updated according to suggestions above:
> > >>>>>>>>>>>>>>>> 1. Use static “of” method for constructing
> > >>>>>>> RescanRuntimeProvider
> > >>>>>>>>>>>>> considering both arguments are required.
> > >>>>>>>>>>>>>>>> 2. Introduce new table options matching
> > >>>>>>> DefaultLookupCacheFactory
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Wed, May 18, 2022 at 2:57 PM Jark Wu <
> > >>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 1) retry logic
> > >>>>>>>>>>>>>>>>> I think we can extract some common retry logic into
> > >>>>> utilities,
> > >>>>>>>>>>> e.g.
> > >>>>>>>>>>>>> RetryUtils#tryTimes(times, call).
> > >>>>>>>>>>>>>>>>> This seems independent of this FLIP and can be reused
> > >> by
> > >>>>>>>>>>> DataStream
> > >>>>>>>>>>>>> users.
> > >>>>>>>>>>>>>>>>> Maybe we can open an issue to discuss this and where
> > >> to
> > >>>> put
> > >>>>>>> it.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> 2) cache ConfigOptions
> > >>>>>>>>>>>>>>>>> I'm fine with defining cache config options in the
> > >>>>> framework.
> > >>>>>>>>>>>>>>>>> A candidate place to put is FactoryUtil which also
> > >>>> includes
> > >>>>>>>>>>>>> "sink.parallelism", "format" options.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Wed, 18 May 2022 at 13:52, Александр Смирнов <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thank you for considering my comments.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> there might be custom logic before making retry,
> > >> such
> > >>> as
> > >>>>>>>>>>>>> re-establish the connection
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Yes, I understand that. I meant that such logic can
> > >> be
> > >>>>>>> placed in
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>> separate function, that can be implemented by
> > >>> connectors.
> > >>>>>>> Just
> > >>>>>>>>>>>> moving
> > >>>>>>>>>>>>>>>>>> the retry logic would make connector's LookupFunction
> > >>>> more
> > >>>>>>>>>>> concise
> > >>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> avoid duplicate code. However, it's a minor change.
> > >> The
> > >>>>>>> decision
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>> up
> > >>>>>>>>>>>>>>>>>> to you.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > >>>>>>> developers
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> What is the reason for that? One of the main goals of
> > >>>> this
> > >>>>>>> FLIP
> > >>>>>>>>>>> was
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>> unify the configs, wasn't it? I understand that
> > >> current
> > >>>>> cache
> > >>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>> doesn't depend on ConfigOptions, like was before. But
> > >>>> still
> > >>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>> put
> > >>>>>>>>>>>>>>>>>> these options into the framework, so connectors can
> > >>> reuse
> > >>>>>>> them
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> avoid code duplication, and, what is more
> > >> significant,
> > >>>>> avoid
> > >>>>>>>>>>>> possible
> > >>>>>>>>>>>>>>>>>> different options naming. This moment can be pointed
> > >>> out
> > >>>> in
> > >>>>>>>>>>>>>>>>>> documentation for connector developers.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>> Alexander
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> вт, 17 мая 2022 г. в 17:11, Qingsheng Ren <
> > >>>>>>> [email protected]>:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi Alexander,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks for the review and glad to see we are on the
> > >>> same
> > >>>>>>> page!
> > >>>>>>>>> I
> > >>>>>>>>>>>>> think you forgot to cc the dev mailing list so I’m also
> > >>>> quoting
> > >>>>>>> your
> > >>>>>>>>>>>> reply
> > >>>>>>>>>>>>> under this email.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> We can add 'maxRetryTimes' option into this class
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> In my opinion the retry logic should be implemented
> > >> in
> > >>>>>>> lookup()
> > >>>>>>>>>>>>> instead of in LookupFunction#eval(). Retrying is only
> > >>>> meaningful
> > >>>>>>>>> under
> > >>>>>>>>>>>> some
> > >>>>>>>>>>>>> specific retriable failures, and there might be custom
> > >> logic
> > >>>>>>> before
> > >>>>>>>>>>>> making
> > >>>>>>>>>>>>> retry, such as re-establish the connection
> > >>>>>>> (JdbcRowDataLookupFunction
> > >>>>>>>>>>> is
> > >>>>>>>>>>>> an
> > >>>>>>>>>>>>> example), so it's more handy to leave it to the connector.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I don't see DDL options, that were in previous
> > >>> version
> > >>>> of
> > >>>>>>>>> FLIP.
> > >>>>>>>>>>>> Do
> > >>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> We decide not to provide common DDL options and let
> > >>>>>>> developers
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> define their own options as we do now per connector.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> The rest of comments sound great and I’ll update the
> > >>>> FLIP.
> > >>>>>>> Hope
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> can finalize our proposal soon!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> On May 17, 2022, at 13:46, Александр Смирнов <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Hi Qingsheng and devs!
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I like the overall design of updated FLIP, however
> > >> I
> > >>>> have
> > >>>>>>>>>>> several
> > >>>>>>>>>>>>>>>>>>>> suggestions and questions.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 1) Introducing LookupFunction as a subclass of
> > >>>>>>> TableFunction
> > >>>>>>>>>>> is a
> > >>>>>>>>>>>>> good
> > >>>>>>>>>>>>>>>>>>>> idea. We can add 'maxRetryTimes' option into this
> > >>>> class.
> > >>>>>>>>> 'eval'
> > >>>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>> of new LookupFunction is great for this purpose.
> > >> The
> > >>>> same
> > >>>>>>> is
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>> 'async' case.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 2) There might be other configs in future, such as
> > >>>>>>>>>>>>> 'cacheMissingKey'
> > >>>>>>>>>>>>>>>>>>>> in LookupFunctionProvider or 'rescanInterval' in
> > >>>>>>>>>>>>> ScanRuntimeProvider.
> > >>>>>>>>>>>>>>>>>>>> Maybe use Builder pattern in LookupFunctionProvider
> > >>> and
> > >>>>>>>>>>>>>>>>>>>> RescanRuntimeProvider for more flexibility (use one
> > >>>>> 'build'
> > >>>>>>>>>>>> method
> > >>>>>>>>>>>>>>>>>>>> instead of many 'of' methods in future)?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 3) What are the plans for existing
> > >>>> TableFunctionProvider
> > >>>>>>> and
> > >>>>>>>>>>>>>>>>>>>> AsyncTableFunctionProvider? I think they should be
> > >>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 4) Am I right that the current design does not
> > >> assume
> > >>>>>>> usage of
> > >>>>>>>>>>>>>>>>>>>> user-provided LookupCache in re-scanning? In this
> > >>> case,
> > >>>>> it
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>> clear why do we need methods such as 'invalidate'
> > >> or
> > >>>>>>> 'putAll'
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>> LookupCache.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> 5) I don't see DDL options, that were in previous
> > >>>> version
> > >>>>>>> of
> > >>>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>> Do
> > >>>>>>>>>>>>>>>>>>>> you have any special plans for them?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> If you don't mind, I would be glad to be able to
> > >> make
> > >>>>> small
> > >>>>>>>>>>>>>>>>>>>> adjustments to the FLIP document too. I think it's
> > >>>> worth
> > >>>>>>>>>>>> mentioning
> > >>>>>>>>>>>>>>>>>>>> about what exactly optimizations are planning in
> > >> the
> > >>>>>>> future.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> пт, 13 мая 2022 г. в 20:27, Qingsheng Ren <
> > >>>>>>> [email protected]
> > >>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Hi Alexander and devs,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thank you very much for the in-depth discussion!
> > >> As
> > >>>> Jark
> > >>>>>>>>>>>>> mentioned we were inspired by Alexander's idea and made a
> > >>>>>>> refactor on
> > >>>>>>>>>>> our
> > >>>>>>>>>>>>> design. FLIP-221 [1] has been updated to reflect our
> > >> design
> > >>>> now
> > >>>>>>> and
> > >>>>>>>>> we
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> happy to hear more suggestions from you!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Compared to the previous design:
> > >>>>>>>>>>>>>>>>>>>>> 1. The lookup cache serves at table runtime level
> > >>> and
> > >>>> is
> > >>>>>>>>>>>>> integrated as a component of LookupJoinRunner as discussed
> > >>>>>>>>> previously.
> > >>>>>>>>>>>>>>>>>>>>> 2. Interfaces are renamed and re-designed to
> > >> reflect
> > >>>> the
> > >>>>>>> new
> > >>>>>>>>>>>>> design.
> > >>>>>>>>>>>>>>>>>>>>> 3. We separate the all-caching case individually
> > >> and
> > >>>>>>>>>>> introduce a
> > >>>>>>>>>>>>> new RescanRuntimeProvider to reuse the ability of
> > >> scanning.
> > >>> We
> > >>>>> are
> > >>>>>>>>>>>> planning
> > >>>>>>>>>>>>> to support SourceFunction / InputFormat for now
> > >> considering
> > >>>> the
> > >>>>>>>>>>>> complexity
> > >>>>>>>>>>>>> of FLIP-27 Source API.
> > >>>>>>>>>>>>>>>>>>>>> 4. A new interface LookupFunction is introduced to
> > >>>> make
> > >>>>>>> the
> > >>>>>>>>>>>>> semantic of lookup more straightforward for developers.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> For replying to Alexander:
> > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > >>> is
> > >>>>>>>>>>> deprecated
> > >>>>>>>>>>>>> or not. Am I right that it will be so in the future, but
> > >>>>> currently
> > >>>>>>>>> it's
> > >>>>>>>>>>>> not?
> > >>>>>>>>>>>>>>>>>>>>> Yes you are right. InputFormat is not deprecated
> > >> for
> > >>>>> now.
> > >>>>>>> I
> > >>>>>>>>>>>> think
> > >>>>>>>>>>>>> it will be deprecated in the future but we don't have a
> > >>> clear
> > >>>>> plan
> > >>>>>>>>> for
> > >>>>>>>>>>>> that.
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Thanks again for the discussion on this FLIP and
> > >>>> looking
> > >>>>>>>>>>> forward
> > >>>>>>>>>>>>> to cooperating with you after we finalize the design and
> > >>>>>>> interfaces!
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> On Fri, May 13, 2022 at 12:12 AM Александр
> > >> Смирнов <
> > >>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Hi Jark, Qingsheng and Leonard!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Glad to see that we came to a consensus on almost
> > >>> all
> > >>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> However I'm a little confused whether InputFormat
> > >>> is
> > >>>>>>>>>>> deprecated
> > >>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>> not. Am I right that it will be so in the future,
> > >>> but
> > >>>>>>>>>>> currently
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>>>>>>>>>>> not? Actually I also think that for the first
> > >>> version
> > >>>>>>> it's
> > >>>>>>>>> OK
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>> use
> > >>>>>>>>>>>>>>>>>>>>>> InputFormat in ALL cache realization, because
> > >>>>> supporting
> > >>>>>>>>>>> rescan
> > >>>>>>>>>>>>>>>>>>>>>> ability seems like a very distant prospect. But
> > >> for
> > >>>>> this
> > >>>>>>>>>>>>> decision we
> > >>>>>>>>>>>>>>>>>>>>>> need a consensus among all discussion
> > >> participants.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> In general, I don't have something to argue with
> > >>> your
> > >>>>>>>>>>>>> statements. All
> > >>>>>>>>>>>>>>>>>>>>>> of them correspond my ideas. Looking ahead, it
> > >>> would
> > >>>> be
> > >>>>>>> nice
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>> on this FLIP cooperatively. I've already done a
> > >> lot
> > >>>> of
> > >>>>>>> work
> > >>>>>>>>>>> on
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>> join caching with realization very close to the
> > >> one
> > >>>> we
> > >>>>>>> are
> > >>>>>>>>>>>>> discussing,
> > >>>>>>>>>>>>>>>>>>>>>> and want to share the results of this work.
> > >> Anyway
> > >>>>>>> looking
> > >>>>>>>>>>>>> forward for
> > >>>>>>>>>>>>>>>>>>>>>> the FLIP update!
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 17:38, Jark Wu <
> > >>>> [email protected]
> > >>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Hi Alex,
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Thanks for summarizing your points.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> In the past week, Qingsheng, Leonard, and I have
> > >>>>>>> discussed
> > >>>>>>>>>>> it
> > >>>>>>>>>>>>> several times
> > >>>>>>>>>>>>>>>>>>>>>>> and we have totally refactored the design.
> > >>>>>>>>>>>>>>>>>>>>>>> I'm glad to say we have reached a consensus on
> > >>> many
> > >>>> of
> > >>>>>>> your
> > >>>>>>>>>>>>> points!
> > >>>>>>>>>>>>>>>>>>>>>>> Qingsheng is still working on updating the
> > >> design
> > >>>> docs
> > >>>>>>> and
> > >>>>>>>>>>>>> maybe can be
> > >>>>>>>>>>>>>>>>>>>>>>> available in the next few days.
> > >>>>>>>>>>>>>>>>>>>>>>> I will share some conclusions from our
> > >>> discussions:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 1) we have refactored the design towards to
> > >> "cache
> > >>>> in
> > >>>>>>>>>>>>> framework" way.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 2) a "LookupCache" interface for users to
> > >>> customize
> > >>>>> and
> > >>>>>>> a
> > >>>>>>>>>>>>> default
> > >>>>>>>>>>>>>>>>>>>>>>> implementation with builder for users to
> > >> easy-use.
> > >>>>>>>>>>>>>>>>>>>>>>> This can both make it possible to both have
> > >>>>> flexibility
> > >>>>>>> and
> > >>>>>>>>>>>>> conciseness.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 3) Filter pushdown is important for ALL and LRU
> > >>>> lookup
> > >>>>>>>>>>> cache,
> > >>>>>>>>>>>>> esp reducing
> > >>>>>>>>>>>>>>>>>>>>>>> IO.
> > >>>>>>>>>>>>>>>>>>>>>>> Filter pushdown should be the final state and
> > >> the
> > >>>>>>> unified
> > >>>>>>>>>>> way
> > >>>>>>>>>>>>> to both
> > >>>>>>>>>>>>>>>>>>>>>>> support pruning ALL cache and LRU cache,
> > >>>>>>>>>>>>>>>>>>>>>>> so I think we should make effort in this
> > >>> direction.
> > >>>> If
> > >>>>>>> we
> > >>>>>>>>>>> need
> > >>>>>>>>>>>>> to support
> > >>>>>>>>>>>>>>>>>>>>>>> filter pushdown for ALL cache anyway, why not
> > >> use
> > >>>>>>>>>>>>>>>>>>>>>>> it for LRU cache as well? Either way, as we
> > >> decide
> > >>>> to
> > >>>>>>>>>>>> implement
> > >>>>>>>>>>>>> the cache
> > >>>>>>>>>>>>>>>>>>>>>>> in the framework, we have the chance to support
> > >>>>>>>>>>>>>>>>>>>>>>> filter on cache anytime. This is an optimization
> > >>> and
> > >>>>> it
> > >>>>>>>>>>>> doesn't
> > >>>>>>>>>>>>> affect the
> > >>>>>>>>>>>>>>>>>>>>>>> public API. I think we can create a JIRA issue
> > >> to
> > >>>>>>>>>>>>>>>>>>>>>>> discuss it when the FLIP is accepted.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> 4) The idea to support ALL cache is similar to
> > >>> your
> > >>>>>>>>>>> proposal.
> > >>>>>>>>>>>>>>>>>>>>>>> In the first version, we will only support
> > >>>>> InputFormat,
> > >>>>>>>>>>>>> SourceFunction for
> > >>>>>>>>>>>>>>>>>>>>>>> cache all (invoke InputFormat in join operator).
> > >>>>>>>>>>>>>>>>>>>>>>> For FLIP-27 source, we need to join a true
> > >> source
> > >>>>>>> operator
> > >>>>>>>>>>>>> instead of
> > >>>>>>>>>>>>>>>>>>>>>>> calling it embedded in the join operator.
> > >>>>>>>>>>>>>>>>>>>>>>> However, this needs another FLIP to support the
> > >>>>> re-scan
> > >>>>>>>>>>>> ability
> > >>>>>>>>>>>>> for FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>> Source, and this can be a large work.
> > >>>>>>>>>>>>>>>>>>>>>>> In order to not block this issue, we can put the
> > >>>>> effort
> > >>>>>>> of
> > >>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>> integration into future work and integrate
> > >>>>>>>>>>>>>>>>>>>>>>> InputFormat&SourceFunction for now.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> I think it's fine to use
> > >>> InputFormat&SourceFunction,
> > >>>>> as
> > >>>>>>>>> they
> > >>>>>>>>>>>>> are not
> > >>>>>>>>>>>>>>>>>>>>>>> deprecated, otherwise, we have to introduce
> > >>> another
> > >>>>>>>>> function
> > >>>>>>>>>>>>>>>>>>>>>>> similar to them which is meaningless. We need to
> > >>>> plan
> > >>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>> integration ASAP before InputFormat &
> > >>> SourceFunction
> > >>>>> are
> > >>>>>>>>>>>>> deprecated.
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 15:46, Александр Смирнов
> > >> <
> > >>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Hi Martijn!
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Got it. Therefore, the realization with
> > >>> InputFormat
> > >>>>> is
> > >>>>>>> not
> > >>>>>>>>>>>>> considered.
> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks for clearing that up!
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> чт, 12 мая 2022 г. в 14:23, Martijn Visser <
> > >>>>>>>>>>>>> [email protected]>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> With regards to:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But if there are plans to refactor all
> > >>> connectors
> > >>>>> to
> > >>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Yes, FLIP-27 is the target for all connectors.
> > >>> The
> > >>>>> old
> > >>>>>>>>>>>>> interfaces will be
> > >>>>>>>>>>>>>>>>>>>>>>>>> deprecated and connectors will either be
> > >>>> refactored
> > >>>>> to
> > >>>>>>>>> use
> > >>>>>>>>>>>>> the new ones
> > >>>>>>>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>>>>>> dropped.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> The caching should work for connectors that
> > >> are
> > >>>>> using
> > >>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>> interfaces,
> > >>>>>>>>>>>>>>>>>>>>>>>>> we should not introduce new features for old
> > >>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 12 May 2022 at 06:19, Александр
> > >> Смирнов
> > >>> <
> > >>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jark!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for the late response. I would like to
> > >>> make
> > >>>>>>> some
> > >>>>>>>>>>>>> comments and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> clarify my points.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) I agree with your first statement. I think
> > >>> we
> > >>>>> can
> > >>>>>>>>>>>> achieve
> > >>>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>>>> advantages this way: put the Cache interface
> > >> in
> > >>>>>>>>>>>>> flink-table-common,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> but have implementations of it in
> > >>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>> Therefore if a
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connector developer wants to use existing
> > >> cache
> > >>>>>>>>>>> strategies
> > >>>>>>>>>>>>> and their
> > >>>>>>>>>>>>>>>>>>>>>>>>>> implementations, he can just pass
> > >> lookupConfig
> > >>> to
> > >>>>> the
> > >>>>>>>>>>>>> planner, but if
> > >>>>>>>>>>>>>>>>>>>>>>>>>> he wants to have its own cache implementation
> > >>> in
> > >>>>> his
> > >>>>>>>>>>>>> TableFunction, it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> will be possible for him to use the existing
> > >>>>>>> interface
> > >>>>>>>>>>> for
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>> purpose (we can explicitly point this out in
> > >>> the
> > >>>>>>>>>>>>> documentation). In
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this way all configs and metrics will be
> > >>> unified.
> > >>>>>>> WDYT?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > >>> cache,
> > >>>> we
> > >>>>>>> will
> > >>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Let me clarify the logic filters
> > >>> optimization
> > >>>> in
> > >>>>>>> case
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> LRU cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> It looks like Cache<RowData,
> > >>>> Collection<RowData>>.
> > >>>>>>> Here
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> always
> > >>>>>>>>>>>>>>>>>>>>>>>>>> store the response of the dimension table in
> > >>>> cache,
> > >>>>>>> even
> > >>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>> applying calc function. I.e. if there are no
> > >>> rows
> > >>>>>>> after
> > >>>>>>>>>>>>> applying
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filters to the result of the 'eval' method of
> > >>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>> we store
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the empty list by lookup keys. Therefore the
> > >>>> cache
> > >>>>>>> line
> > >>>>>>>>>>>> will
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filled, but will require much less memory (in
> > >>>>> bytes).
> > >>>>>>>>>>> I.e.
> > >>>>>>>>>>>>> we don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>> completely filter keys, by which result was
> > >>>> pruned,
> > >>>>>>> but
> > >>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>> reduce required memory to store this result.
> > >> If
> > >>>> the
> > >>>>>>> user
> > >>>>>>>>>>>>> knows about
> > >>>>>>>>>>>>>>>>>>>>>>>>>> this behavior, he can increase the 'max-rows'
> > >>>>> option
> > >>>>>>>>>>> before
> > >>>>>>>>>>>>> the start
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of the job. But actually I came up with the
> > >>> idea
> > >>>>>>> that we
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>> do this
> > >>>>>>>>>>>>>>>>>>>>>>>>>> automatically by using the 'maximumWeight'
> > >> and
> > >>>>>>> 'weigher'
> > >>>>>>>>>>>>> methods of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> GuavaCache [1]. Weight can be the size of the
> > >>>>>>> collection
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>> rows
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (value of cache). Therefore cache can
> > >>>> automatically
> > >>>>>>> fit
> > >>>>>>>>>>>> much
> > >>>>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>>>> records than before.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Flink SQL has provided a standard way to do
> > >>>>> filters
> > >>>>>>> and
> > >>>>>>>>>>>>> projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>> interfaces,
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> It's debatable how difficult it will be to
> > >>>>> implement
> > >>>>>>>>>>> filter
> > >>>>>>>>>>>>> pushdown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But I think the fact that currently there is
> > >> no
> > >>>>>>> database
> > >>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>> with filter pushdown at least means that this
> > >>>>> feature
> > >>>>>>>>>>> won't
> > >>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>> supported soon in connectors. Moreover, if we
> > >>>> talk
> > >>>>>>> about
> > >>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors (not in Flink repo), their
> > >> databases
> > >>>>> might
> > >>>>>>>>> not
> > >>>>>>>>>>>>> support all
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink filters (or not support filters at
> > >> all).
> > >>> I
> > >>>>>>> think
> > >>>>>>>>>>>> users
> > >>>>>>>>>>>>> are
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interested in supporting cache filters
> > >>>> optimization
> > >>>>>>>>>>>>> independently of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> supporting other features and solving more
> > >>>> complex
> > >>>>>>>>>>> problems
> > >>>>>>>>>>>>> (or
> > >>>>>>>>>>>>>>>>>>>>>>>>>> unsolvable at all).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) I agree with your third statement.
> > >> Actually
> > >>> in
> > >>>>> our
> > >>>>>>>>>>>>> internal version
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I also tried to unify the logic of scanning
> > >> and
> > >>>>>>>>> reloading
> > >>>>>>>>>>>>> data from
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. But unfortunately, I didn't find
> > >> a
> > >>>> way
> > >>>>> to
> > >>>>>>>>>>> unify
> > >>>>>>>>>>>>> the logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of all ScanRuntimeProviders (InputFormat,
> > >>>>>>>>> SourceFunction,
> > >>>>>>>>>>>>> Source,...)
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and reuse it in reloading ALL cache. As a
> > >>> result
> > >>>> I
> > >>>>>>>>>>> settled
> > >>>>>>>>>>>>> on using
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat, because it was used for scanning
> > >>> in
> > >>>>> all
> > >>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors. (I didn't know that there are
> > >> plans
> > >>>> to
> > >>>>>>>>>>>> deprecate
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat in favor of FLIP-27 Source). IMO
> > >>>> usage
> > >>>>> of
> > >>>>>>>>>>>>> FLIP-27 source
> > >>>>>>>>>>>>>>>>>>>>>>>>>> in ALL caching is not good idea, because this
> > >>>>> source
> > >>>>>>> was
> > >>>>>>>>>>>>> designed to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> work in distributed environment
> > >>> (SplitEnumerator
> > >>>> on
> > >>>>>>>>>>>>> JobManager and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SourceReaders on TaskManagers), not in one
> > >>>> operator
> > >>>>>>>>>>> (lookup
> > >>>>>>>>>>>>> join
> > >>>>>>>>>>>>>>>>>>>>>>>>>> operator in our case). There is even no
> > >> direct
> > >>>> way
> > >>>>> to
> > >>>>>>>>>>> pass
> > >>>>>>>>>>>>> splits from
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumerator to SourceReader (this logic
> > >>> works
> > >>>>>>>>> through
> > >>>>>>>>>>>>>>>>>>>>>>>>>> SplitEnumeratorContext, which requires
> > >>>>>>>>>>>>>>>>>>>>>>>>>> OperatorCoordinator.SubtaskGateway to send
> > >>>>>>>>>>> AddSplitEvents).
> > >>>>>>>>>>>>> Usage of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat for ALL cache seems much more
> > >>> clearer
> > >>>>> and
> > >>>>>>>>>>>>> easier. But if
> > >>>>>>>>>>>>>>>>>>>>>>>>>> there are plans to refactor all connectors to
> > >>>>>>> FLIP-27, I
> > >>>>>>>>>>>>> have the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> following ideas: maybe we can refuse from
> > >>> lookup
> > >>>>> join
> > >>>>>>>>> ALL
> > >>>>>>>>>>>>> cache in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> favor of simple join with multiple scanning
> > >> of
> > >>>>> batch
> > >>>>>>>>>>>> source?
> > >>>>>>>>>>>>> The point
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is that the only difference between lookup
> > >> join
> > >>>> ALL
> > >>>>>>>>> cache
> > >>>>>>>>>>>>> and simple
> > >>>>>>>>>>>>>>>>>>>>>>>>>> join with batch source is that in the first
> > >>> case
> > >>>>>>>>> scanning
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>> performed
> > >>>>>>>>>>>>>>>>>>>>>>>>>> multiple times, in between which state
> > >> (cache)
> > >>> is
> > >>>>>>>>> cleared
> > >>>>>>>>>>>>> (correct me
> > >>>>>>>>>>>>>>>>>>>>>>>>>> if I'm wrong). So what if we extend the
> > >>>>>>> functionality of
> > >>>>>>>>>>>>> simple join
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to support state reloading + extend the
> > >>>>>>> functionality of
> > >>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>> batch source multiple times (this one should
> > >> be
> > >>>>> easy
> > >>>>>>>>> with
> > >>>>>>>>>>>>> new FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source, that unifies streaming/batch reading
> > >> -
> > >>> we
> > >>>>>>> will
> > >>>>>>>>>>> need
> > >>>>>>>>>>>>> to change
> > >>>>>>>>>>>>>>>>>>>>>>>>>> only SplitEnumerator, which will pass splits
> > >>>> again
> > >>>>>>> after
> > >>>>>>>>>>>>> some TTL).
> > >>>>>>>>>>>>>>>>>>>>>>>>>> WDYT? I must say that this looks like a
> > >>> long-term
> > >>>>>>> goal
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> will make
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the scope of this FLIP even larger than you
> > >>> said.
> > >>>>>>> Maybe
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>> can limit
> > >>>>>>>>>>>>>>>>>>>>>>>>>> ourselves to a simpler solution now
> > >>>> (InputFormats).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> So to sum up, my points is like this:
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 1) There is a way to make both concise and
> > >>>> flexible
> > >>>>>>>>>>>>> interfaces for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching in lookup join.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 2) Cache filters optimization is important
> > >> both
> > >>>> in
> > >>>>>>> LRU
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> ALL caches.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 3) It is unclear when filter pushdown will be
> > >>>>>>> supported
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connectors, some of the connectors might not
> > >>> have
> > >>>>> the
> > >>>>>>>>>>>>> opportunity to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> support filter pushdown + as I know,
> > >> currently
> > >>>>> filter
> > >>>>>>>>>>>>> pushdown works
> > >>>>>>>>>>>>>>>>>>>>>>>>>> only for scanning (not lookup). So cache
> > >>> filters
> > >>>> +
> > >>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>> optimization should be independent from other
> > >>>>>>> features.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> 4) ALL cache realization is a complex topic
> > >>> that
> > >>>>>>>>> involves
> > >>>>>>>>>>>>> multiple
> > >>>>>>>>>>>>>>>>>>>>>>>>>> aspects of how Flink is developing. Refusing
> > >>> from
> > >>>>>>>>>>>>> InputFormat in favor
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of FLIP-27 Source will make ALL cache
> > >>> realization
> > >>>>>>> really
> > >>>>>>>>>>>>> complex and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> not clear, so maybe instead of that we can
> > >>> extend
> > >>>>> the
> > >>>>>>>>>>>>> functionality of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> simple join or not refuse from InputFormat in
> > >>>> case
> > >>>>> of
> > >>>>>>>>>>>> lookup
> > >>>>>>>>>>>>> join ALL
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://guava.dev/releases/18.0/api/docs/com/google/common/cache/CacheBuilder.html#weigher(com.google.common.cache.Weigher)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> чт, 5 мая 2022 г. в 20:34, Jark Wu <
> > >>>>> [email protected]
> > >>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> It's great to see the active discussion! I
> > >>> want
> > >>>> to
> > >>>>>>>>> share
> > >>>>>>>>>>>> my
> > >>>>>>>>>>>>> ideas:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 1) implement the cache in framework vs.
> > >>>> connectors
> > >>>>>>> base
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have a strong opinion on this. Both
> > >>> ways
> > >>>>>>> should
> > >>>>>>>>>>>>> work (e.g.,
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> pruning, compatibility).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> The framework way can provide more concise
> > >>>>>>> interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> The connector base way can define more
> > >>> flexible
> > >>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> strategies/implementations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating a way to see if
> > >> we
> > >>>> can
> > >>>>>>> have
> > >>>>>>>>>>>> both
> > >>>>>>>>>>>>>>>>>>>>>>>> advantages.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We should reach a consensus that the way
> > >>> should
> > >>>>> be a
> > >>>>>>>>>>> final
> > >>>>>>>>>>>>> state,
> > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> are on the path to it.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 2) filters and projections pushdown:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I agree with Alex that the filter pushdown
> > >>> into
> > >>>>>>> cache
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>> benefit a
> > >>>>>>>>>>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> ALL cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, this is not true for LRU cache.
> > >>>>> Connectors
> > >>>>>>> use
> > >>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>> reduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>> IO
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests to databases for better throughput.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If a filter can prune 90% of data in the
> > >>> cache,
> > >>>> we
> > >>>>>>> will
> > >>>>>>>>>>>>> have 90% of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> requests that can never be cached
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> and hit directly to the databases. That
> > >> means
> > >>>> the
> > >>>>>>> cache
> > >>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>> meaningless in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> this case.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> IMO, Flink SQL has provided a standard way
> > >> to
> > >>> do
> > >>>>>>>>> filters
> > >>>>>>>>>>>>> and projects
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown, i.e., SupportsFilterPushDown and
> > >>>>>>>>>>>>>>>>>>>>>>>> SupportsProjectionPushDown.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jdbc/hive/HBase haven't implemented the
> > >>>>> interfaces,
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> mean it's
> > >>>>>>>>>>>>>>>>>>>>>>>> hard
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> implement.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> They should implement the pushdown
> > >> interfaces
> > >>> to
> > >>>>>>> reduce
> > >>>>>>>>>>> IO
> > >>>>>>>>>>>>> and the
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> size.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> That should be a final state that the scan
> > >>>> source
> > >>>>>>> and
> > >>>>>>>>>>>>> lookup source
> > >>>>>>>>>>>>>>>>>>>>>>>> share
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> the exact pushdown implementation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> I don't see why we need to duplicate the
> > >>>> pushdown
> > >>>>>>> logic
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>> caches,
> > >>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> will complex the lookup join design.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> 3) ALL cache abstraction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> All cache might be the most challenging part
> > >>> of
> > >>>>> this
> > >>>>>>>>>>> FLIP.
> > >>>>>>>>>>>>> We have
> > >>>>>>>>>>>>>>>>>>>>>>>> never
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> provided a reload-lookup public interface.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Currently, we put the reload logic in the
> > >>> "eval"
> > >>>>>>> method
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> That's hard for some sources (e.g., Hive).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Ideally, connector implementation should
> > >> share
> > >>>> the
> > >>>>>>>>> logic
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> scan, i.e. ScanTableSource with
> > >>>>>>>>>>>>> InputFormat/SourceFunction/FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> However, InputFormat/SourceFunction are
> > >>>>> deprecated,
> > >>>>>>> and
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> FLIP-27
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> is deeply coupled with SourceOperator.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> If we want to invoke the FLIP-27 source in
> > >>>>>>> LookupJoin,
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> may make
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> scope of this FLIP much larger.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> We are still investigating how to abstract
> > >> the
> > >>>> ALL
> > >>>>>>>>> cache
> > >>>>>>>>>>>>> logic and
> > >>>>>>>>>>>>>>>>>>>>>>>> reuse
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> the existing source interfaces.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jark
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 20:22, Roman Boyko <
> > >>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It's a much more complicated activity and
> > >>> lies
> > >>>>> out
> > >>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> scope of
> > >>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> improvement. Because such pushdowns should
> > >> be
> > >>>>> done
> > >>>>>>> for
> > >>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>> ScanTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> implementations (not only for Lookup ones).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 19:02, Martijn
> > >> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> One question regarding "And Alexander
> > >>>> correctly
> > >>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> pushdown still is not implemented for
> > >>>>>>>>>>> jdbc/hive/hbase."
> > >>>>>>>>>>>>> -> Would
> > >>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alternative solution be to actually
> > >>> implement
> > >>>>>>> these
> > >>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>> pushdowns?
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> imagine that there are many more benefits
> > >> to
> > >>>>> doing
> > >>>>>>>>>>> that,
> > >>>>>>>>>>>>> outside
> > >>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching and metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn Visser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://twitter.com/MartijnVisser82
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/MartijnVisser
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 5 May 2022 at 13:58, Roman Boyko <
> > >>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving such a valuable
> > >>>> improvement!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I do think that single cache
> > >> implementation
> > >>>>>>> would be
> > >>>>>>>>>>> a
> > >>>>>>>>>>>>> nice
> > >>>>>>>>>>>>>>>>>>>>>>>>>> opportunity
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users. And it will break the "FOR
> > >>> SYSTEM_TIME
> > >>>>> AS
> > >>>>>>> OF
> > >>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>> semantics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> anyway - doesn't matter how it will be
> > >>>>>>> implemented.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Putting myself in the user's shoes, I can
> > >>> say
> > >>>>>>> that:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) I would prefer to have the opportunity
> > >>> to
> > >>>>> cut
> > >>>>>>> off
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> simply filtering unnecessary data. And
> > >> the
> > >>>> most
> > >>>>>>>>> handy
> > >>>>>>>>>>>>> way to do
> > >>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it inside LookupRunners. It would be a
> > >> bit
> > >>>>>>> harder to
> > >>>>>>>>>>>>> pass it
> > >>>>>>>>>>>>>>>>>>>>>>>>>> through the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoin node to TableFunction. And
> > >>>> Alexander
> > >>>>>>>>>>>> correctly
> > >>>>>>>>>>>>>>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter pushdown still is not implemented
> > >>> for
> > >>>>>>>>>>>>> jdbc/hive/hbase.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) The ability to set the different
> > >> caching
> > >>>>>>>>>>> parameters
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is quite important. So I would prefer to
> > >>> set
> > >>>> it
> > >>>>>>>>>>> through
> > >>>>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have similar ttla, strategy and other
> > >>> options
> > >>>>> for
> > >>>>>>>>> all
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> tables.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Providing the cache into the framework
> > >>>>> really
> > >>>>>>>>>>>>> deprives us of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extensibility (users won't be able to
> > >>>> implement
> > >>>>>>>>> their
> > >>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>> cache).
> > >>>>>>>>>>>>>>>>>>>>>>>>>> But
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> probably it might be solved by creating
> > >>> more
> > >>>>>>>>>>> different
> > >>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> strategies
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a wider set of configurations.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> All these points are much closer to the
> > >>>> schema
> > >>>>>>>>>>> proposed
> > >>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Alexander.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingshen Ren, please correct me if I'm
> > >> not
> > >>>>> right
> > >>>>>>> and
> > >>>>>>>>>>>> all
> > >>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> facilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> might be simply implemented in your
> > >>>>> architecture?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: [email protected]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 4 May 2022 at 21:01, Martijn
> > >>> Visser <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't have much to chip in, but just
> > >>>> wanted
> > >>>>> to
> > >>>>>>>>>>>>> express that
> > >>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciate the in-depth discussion on
> > >> this
> > >>>>> topic
> > >>>>>>>>>>> and I
> > >>>>>>>>>>>>> hope
> > >>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> others
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> will join the conversation.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Martijn
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, 3 May 2022 at 10:15, Александр
> > >>>>> Смирнов <
> > >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng, Leonard and Jark,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your detailed feedback!
> > >>>> However, I
> > >>>>>>> have
> > >>>>>>>>>>>>> questions
> > >>>>>>>>>>>>>>>>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> some of your statements (maybe I didn't
> > >>> get
> > >>>>>>>>>>>>> something?).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Caching actually breaks the semantic
> > >> of
> > >>>> "FOR
> > >>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I agree that the semantics of "FOR
> > >>>>> SYSTEM_TIME
> > >>>>>>> AS
> > >>>>>>>>>>> OF
> > >>>>>>>>>>>>>>>>>>>>>>>> proc_time"
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fully implemented with caching, but as
> > >>> you
> > >>>>>>> said,
> > >>>>>>>>>>>> users
> > >>>>>>>>>>>>> go
> > >>>>>>>>>>>>>>>>>>>>>>>> on it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consciously to achieve better
> > >> performance
> > >>>> (no
> > >>>>>>> one
> > >>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> caching by default, etc.). Or by users
> > >> do
> > >>>> you
> > >>>>>>> mean
> > >>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>>>>>>>>>>> developers
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors? In this case developers
> > >>>>> explicitly
> > >>>>>>>>>>>> specify
> > >>>>>>>>>>>>>>>>>>>>>>>> whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector supports caching or not (in
> > >> the
> > >>>>> list
> > >>>>>>> of
> > >>>>>>>>>>>>> supported
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> options),
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> no one makes them do that if they don't
> > >>>> want
> > >>>>>>> to.
> > >>>>>>>>> So
> > >>>>>>>>>>>>> what
> > >>>>>>>>>>>>>>>>>>>>>>>>>> exactly is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the difference between implementing
> > >>> caching
> > >>>>> in
> > >>>>>>>>>>>> modules
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime and in
> > >>>> flink-table-common
> > >>>>>>> from
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> considered
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> point of view? How does it affect on
> > >>>>>>>>>>>>> breaking/non-breaking
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> semantics of "FOR SYSTEM_TIME AS OF
> > >>>>> proc_time"?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> confront a situation that allows table
> > >>>>>>> options in
> > >>>>>>>>>>>> DDL
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> behavior of the framework, which has
> > >>> never
> > >>>>>>>>> happened
> > >>>>>>>>>>>>>>>>>>>>>>>> previously
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be cautious
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we talk about main differences of
> > >>>>> semantics
> > >>>>>>> of
> > >>>>>>>>>>> DDL
> > >>>>>>>>>>>>>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> config options("table.exec.xxx"), isn't
> > >>> it
> > >>>>>>> about
> > >>>>>>>>>>>>> limiting
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> scope
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the options + importance for the user
> > >>>>> business
> > >>>>>>>>>>> logic
> > >>>>>>>>>>>>> rather
> > >>>>>>>>>>>>>>>>>>>>>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> specific location of corresponding
> > >> logic
> > >>> in
> > >>>>> the
> > >>>>>>>>>>>>> framework? I
> > >>>>>>>>>>>>>>>>>>>>>>>>>> mean
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in my design, for example, putting an
> > >>>> option
> > >>>>>>> with
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strategy in configurations would  be
> > >> the
> > >>>>> wrong
> > >>>>>>>>>>>>> decision,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> because it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> directly affects the user's business
> > >>> logic
> > >>>>> (not
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization) + touches just several
> > >>>>> functions
> > >>>>>>> of
> > >>>>>>>>>>> ONE
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> be multiple tables with different
> > >>> caches).
> > >>>>>>> Does it
> > >>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>> matter for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the user (or someone else) where the
> > >>> logic
> > >>>> is
> > >>>>>>>>>>>> located,
> > >>>>>>>>>>>>>>>>>>>>>>>> which is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> affected by the applied option?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also I can remember DDL option
> > >>>>>>> 'sink.parallelism',
> > >>>>>>>>>>>>> which in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> some way
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> "controls the behavior of the
> > >> framework"
> > >>>> and
> > >>>>> I
> > >>>>>>>>>>> don't
> > >>>>>>>>>>>>> see any
> > >>>>>>>>>>>>>>>>>>>>>>>>>> problem
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduce a new interface for this
> > >>>>> all-caching
> > >>>>>>>>>>>>> scenario
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would become more complex
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This is a subject for a separate
> > >>>> discussion,
> > >>>>>>> but
> > >>>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>> in our
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal version we solved this problem
> > >>>> quite
> > >>>>>>>>>>> easily
> > >>>>>>>>>>>> -
> > >>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>> reused
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat class (so there is no need
> > >>> for
> > >>>> a
> > >>>>>>> new
> > >>>>>>>>>>>> API).
> > >>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>> point is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that currently all lookup connectors
> > >> use
> > >>>>>>>>>>> InputFormat
> > >>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>> scanning
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> data in batch mode: HBase, JDBC and
> > >> even
> > >>>> Hive
> > >>>>>>> - it
> > >>>>>>>>>>>> uses
> > >>>>>>>>>>>>>>>>>>>>>>>> class
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PartitionReader, that is actually just
> > >> a
> > >>>>>>> wrapper
> > >>>>>>>>>>>> around
> > >>>>>>>>>>>>>>>>>>>>>>>>>> InputFormat.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The advantage of this solution is the
> > >>>> ability
> > >>>>>>> to
> > >>>>>>>>>>>> reload
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parallel (number of threads depends on
> > >>>> number
> > >>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> InputSplits,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> has
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an upper limit). As a result cache
> > >> reload
> > >>>>> time
> > >>>>>>>>>>>>> significantly
> > >>>>>>>>>>>>>>>>>>>>>>>>>> reduces
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (as well as time of input stream
> > >>>> blocking). I
> > >>>>>>> know
> > >>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> usually
> > >>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> try
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to avoid usage of concurrency in Flink
> > >>>> code,
> > >>>>>>> but
> > >>>>>>>>>>>> maybe
> > >>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>> one
> > >>>>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an exception. BTW I don't say that it's
> > >>> an
> > >>>>>>> ideal
> > >>>>>>>>>>>>> solution,
> > >>>>>>>>>>>>>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are better ones.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Providing the cache in the framework
> > >>> might
> > >>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's possible only in cases when the
> > >>>>> developer
> > >>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> won't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> properly refactor his code and will use
> > >>> new
> > >>>>>>> cache
> > >>>>>>>>>>>>> options
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (i.e. explicitly provide the same
> > >> options
> > >>>>> into
> > >>>>>>> 2
> > >>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> places). For correct behavior all he
> > >> will
> > >>>>> need
> > >>>>>>> to
> > >>>>>>>>>>> do
> > >>>>>>>>>>>>> is to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> redirect
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> existing options to the framework's
> > >>>>>>> LookupConfig
> > >>>>>>>>> (+
> > >>>>>>>>>>>>> maybe
> > >>>>>>>>>>>>>>>>>>>>>>>> add an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> alias
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for options, if there was different
> > >>>> naming),
> > >>>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>> will be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transparent for users. If the developer
> > >>>> won't
> > >>>>>>> do
> > >>>>>>>>>>>>>>>>>>>>>>>> refactoring at
> > >>>>>>>>>>>>>>>>>>>>>>>>>> all,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing will be changed for the
> > >> connector
> > >>>>>>> because
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> backward
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility. Also if a developer
> > >> wants
> > >>> to
> > >>>>> use
> > >>>>>>>>> his
> > >>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>> logic,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> he just can refuse to pass some of the
> > >>>>> configs
> > >>>>>>>>> into
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> instead make his own implementation
> > >> with
> > >>>>>>> already
> > >>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>> configs
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics (but actually I think that
> > >> it's a
> > >>>>> rare
> > >>>>>>>>>>> case).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections should be
> > >> pushed
> > >>>> all
> > >>>>>>> the
> > >>>>>>>>>>> way
> > >>>>>>>>>>>>> down
> > >>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function, like what we do in the scan
> > >>>> source
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It's the great purpose. But the truth
> > >> is
> > >>>> that
> > >>>>>>> the
> > >>>>>>>>>>>> ONLY
> > >>>>>>>>>>>>>>>>>>>>>>>> connector
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> supports filter pushdown is
> > >>>>>>> FileSystemTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (no database connector supports it
> > >>>>> currently).
> > >>>>>>>>> Also
> > >>>>>>>>>>>>> for some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> databases
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it's simply impossible to pushdown such
> > >>>>> complex
> > >>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>> that we
> > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in Flink.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only applying these optimizations to
> > >> the
> > >>>>> cache
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useful
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Filters can cut off an arbitrarily
> > >> large
> > >>>>>>> amount of
> > >>>>>>>>>>>> data
> > >>>>>>>>>>>>>>>>>>>>>>>> from the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dimension table. For a simple example,
> > >>>>> suppose
> > >>>>>>> in
> > >>>>>>>>>>>>> dimension
> > >>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'users'
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have column 'age' with values from
> > >> 20
> > >>> to
> > >>>>> 40,
> > >>>>>>>>> and
> > >>>>>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>>>>>>>> stream
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'clicks' that is ~uniformly distributed
> > >>> by
> > >>>>> age
> > >>>>>>> of
> > >>>>>>>>>>>>> users. If
> > >>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter 'age > 30',
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there will be twice less data in cache.
> > >>>> This
> > >>>>>>> means
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> increase 'lookup.cache.max-rows' by
> > >>> almost
> > >>>> 2
> > >>>>>>>>> times.
> > >>>>>>>>>>>> It
> > >>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> gain a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> huge
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> performance boost. Moreover, this
> > >>>>> optimization
> > >>>>>>>>>>> starts
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> shine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in 'ALL' cache, where tables without
> > >>>> filters
> > >>>>>>> and
> > >>>>>>>>>>>>> projections
> > >>>>>>>>>>>>>>>>>>>>>>>>>> can't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fit
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in memory, but with them - can. This
> > >>> opens
> > >>>> up
> > >>>>>>>>>>>>> additional
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> possibilities
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for users. And this doesn't sound as
> > >> 'not
> > >>>>> quite
> > >>>>>>>>>>>>> useful'.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It would be great to hear other voices
> > >>>>>>> regarding
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> topic!
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Because
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have quite a lot of controversial
> > >>>> points,
> > >>>>>>> and I
> > >>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> help
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of others it will be easier for us to
> > >>> come
> > >>>>> to a
> > >>>>>>>>>>>>> consensus.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пт, 29 апр. 2022 г. в 22:33, Qingsheng
> > >>> Ren
> > >>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>> :
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Alexander and Arvid,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the discussion and sorry
> > >> for
> > >>> my
> > >>>>>>> late
> > >>>>>>>>>>>>> response!
> > >>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>> had
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internal discussion together with Jark
> > >>> and
> > >>>>>>> Leonard
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>> I’d
> > >>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> summarize our ideas. Instead of
> > >>>> implementing
> > >>>>>>> the
> > >>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>> logic in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime layer or wrapping around the
> > >>>>>>> user-provided
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to introduce some new APIs
> > >>> extending
> > >>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> concerns:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. Caching actually breaks the
> > >> semantic
> > >>> of
> > >>>>>>> "FOR
> > >>>>>>>>>>>>>>>>>>>>>>>> SYSTEM_TIME
> > >>>>>>>>>>>>>>>>>>>>>>>>>> AS OF
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proc_time”, because it couldn’t truly
> > >>>> reflect
> > >>>>>>> the
> > >>>>>>>>>>>>> content
> > >>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table at the moment of querying. If
> > >> users
> > >>>>>>> choose
> > >>>>>>>>> to
> > >>>>>>>>>>>>> enable
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup table, they implicitly indicate
> > >>> that
> > >>>>>>> this
> > >>>>>>>>>>>>> breakage is
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> acceptable
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> exchange for the performance. So we
> > >>> prefer
> > >>>>> not
> > >>>>>>> to
> > >>>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caching on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table runtime level.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. If we make the cache implementation
> > >>> in
> > >>>>> the
> > >>>>>>>>>>>>> framework
> > >>>>>>>>>>>>>>>>>>>>>>>>>> (whether
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runner or a wrapper around
> > >>> TableFunction),
> > >>>> we
> > >>>>>>> have
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> confront a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> situation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that allows table options in DDL to
> > >>> control
> > >>>>> the
> > >>>>>>>>>>>>> behavior of
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> framework,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which has never happened previously and
> > >>>>> should
> > >>>>>>> be
> > >>>>>>>>>>>>> cautious.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> Under
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> current design the behavior of the
> > >>>> framework
> > >>>>>>>>> should
> > >>>>>>>>>>>>> only be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> specified
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations (“table.exec.xxx”), and
> > >>> it’s
> > >>>>>>> hard
> > >>>>>>>>> to
> > >>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> general
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configs to a specific table.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. We have use cases that lookup
> > >> source
> > >>>>> loads
> > >>>>>>> and
> > >>>>>>>>>>>>> refresh
> > >>>>>>>>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> records
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> periodically into the memory to achieve
> > >>>> high
> > >>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>> performance
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hive
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connector in the community, and also
> > >>> widely
> > >>>>>>> used
> > >>>>>>>>> by
> > >>>>>>>>>>>> our
> > >>>>>>>>>>>>>>>>>>>>>>>> internal
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connectors). Wrapping the cache around
> > >>> the
> > >>>>>>> user’s
> > >>>>>>>>>>>>>>>>>>>>>>>> TableFunction
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> works
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for LRU caches, but I think we have to
> > >>>>>>> introduce a
> > >>>>>>>>>>>> new
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interface for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all-caching scenario and the design
> > >> would
> > >>>>>>> become
> > >>>>>>>>>>> more
> > >>>>>>>>>>>>>>>>>>>>>>>> complex.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4. Providing the cache in the
> > >> framework
> > >>>>> might
> > >>>>>>>>>>>>> introduce
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> compatibility
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issues to existing lookup sources like
> > >>>> there
> > >>>>>>> might
> > >>>>>>>>>>>>> exist two
> > >>>>>>>>>>>>>>>>>>>>>>>>>> caches
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> totally different strategies if the
> > >> user
> > >>>>>>>>>>> incorrectly
> > >>>>>>>>>>>>>>>>>>>>>>>> configures
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (one in the framework and another
> > >>>> implemented
> > >>>>>>> by
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> source).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As for the optimization mentioned by
> > >>>>>>> Alexander, I
> > >>>>>>>>>>>>> think
> > >>>>>>>>>>>>>>>>>>>>>>>>>> filters
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> projections should be pushed all the
> > >> way
> > >>>> down
> > >>>>>>> to
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>> table
> > >>>>>>>>>>>>>>>>>>>>>>>>>> function,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> what we do in the scan source, instead
> > >> of
> > >>>> the
> > >>>>>>>>>>> runner
> > >>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal of using cache is to reduce the
> > >>>> network
> > >>>>>>> I/O
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> pressure
> > >>>>>>>>>>>>>>>>>>>>>>>>>> on the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> external system, and only applying
> > >> these
> > >>>>>>>>>>>> optimizations
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seems
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> not quite useful.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I made some updates to the FLIP[1] to
> > >>>>> reflect
> > >>>>>>> our
> > >>>>>>>>>>>>> ideas.
> > >>>>>>>>>>>>>>>>>>>>>>>> We
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> prefer to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> keep the cache implementation as a part
> > >>> of
> > >>>>>>>>>>>>> TableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>> and we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provide some helper classes
> > >>>>>>> (CachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AllCachingTableFunction,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CachingAsyncTableFunction) to
> > >> developers
> > >>>> and
> > >>>>>>>>>>> regulate
> > >>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>> of the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I made a POC[2] for your
> > >> reference.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking forward to your ideas!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]
> > >>>>>>>>>>>> https://github.com/PatrickRen/flink/tree/FLIP-221
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 26, 2022 at 4:45 PM
> > >>> Александр
> > >>>>>>> Смирнов
> > >>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the response, Arvid!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have few comments on your message.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but could also live with an easier
> > >>>>> solution
> > >>>>>>> as
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>> first
> > >>>>>>>>>>>>>>>>>>>>>>>>>> step:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think that these 2 ways are
> > >> mutually
> > >>>>>>> exclusive
> > >>>>>>>>>>>>>>>>>>>>>>>> (originally
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> by Qingsheng and mine), because
> > >>>>> conceptually
> > >>>>>>>>> they
> > >>>>>>>>>>>>> follow
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> same
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> goal, but implementation details are
> > >>>>>>> different.
> > >>>>>>>>>>> If
> > >>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> go one
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> moving to another way in the future
> > >>> will
> > >>>>> mean
> > >>>>>>>>>>>>> deleting
> > >>>>>>>>>>>>>>>>>>>>>>>>>> existing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and once again changing the API for
> > >>>>>>> connectors.
> > >>>>>>>>>>> So
> > >>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>> think we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> should
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reach a consensus with the community
> > >>>> about
> > >>>>>>> that
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> then
> > >>>>>>>>>>>>>>>>>>>>>>>> work
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> together
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on this FLIP, i.e. divide the work on
> > >>>> tasks
> > >>>>>>> for
> > >>>>>>>>>>>>> different
> > >>>>>>>>>>>>>>>>>>>>>>>>>> parts
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flip (for example, LRU cache
> > >>> unification
> > >>>> /
> > >>>>>>>>>>>>> introducing
> > >>>>>>>>>>>>>>>>>>>>>>>>>> proposed
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> set
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics / further work…). WDYT,
> > >>>> Qingsheng?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as the source will only receive the
> > >>>>> requests
> > >>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually if filters are applied to
> > >>> fields
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>> table, we
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> firstly must do requests, and only
> > >>> after
> > >>>>>>> that we
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>> filter
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> responses,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because lookup connectors don't have
> > >>>> filter
> > >>>>>>>>>>>>> pushdown. So
> > >>>>>>>>>>>>>>>>>>>>>>>> if
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filtering
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is done before caching, there will be
> > >>>> much
> > >>>>>>> less
> > >>>>>>>>>>>> rows
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>> architecture
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry for that, I’m a bit new to such
> > >>>> kinds
> > >>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> conversations
> > >>>>>>>>>>>>>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no write access to the
> > >>> confluence,
> > >>>>> so
> > >>>>>>> I
> > >>>>>>>>>>>> made a
> > >>>>>>>>>>>>>>>>>>>>>>>> Jira
> > >>>>>>>>>>>>>>>>>>>>>>>>>> issue,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> where described the proposed changes
> > >> in
> > >>>>> more
> > >>>>>>>>>>>> details
> > >>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-27411.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Will happy to get more feedback!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Smirnov Alexander
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> пн, 25 апр. 2022 г. в 19:49, Arvid
> > >>> Heise
> > >>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>> [email protected]>:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for driving this; the
> > >>>> inconsistency
> > >>>>>>> was
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>>>> satisfying
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I second Alexander's idea though but
> > >>>> could
> > >>>>>>> also
> > >>>>>>>>>>>> live
> > >>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> easier
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution as the first step: Instead
> > >> of
> > >>>>>>> making
> > >>>>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> detail of TableFunction X, rather
> > >>>> devise a
> > >>>>>>>>>>> caching
> > >>>>>>>>>>>>>>>>>>>>>>>> layer
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> around X.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposal would be a
> > >>> CachingTableFunction
> > >>>>>>> that
> > >>>>>>>>>>>>>>>>>>>>>>>> delegates to
> > >>>>>>>>>>>>>>>>>>>>>>>>>> X in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> case
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> misses and else manages the cache.
> > >>>> Lifting
> > >>>>>>> it
> > >>>>>>>>>>> into
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> operator
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proposed would be even better but is
> > >>>>>>> probably
> > >>>>>>>>>>>>>>>>>>>>>>>> unnecessary
> > >>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> first step
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for a lookup source (as the source
> > >>> will
> > >>>>> only
> > >>>>>>>>>>>> receive
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> requests
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filter; applying projection may be
> > >>> more
> > >>>>>>>>>>>> interesting
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>> save
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another advantage is that all the
> > >>>> changes
> > >>>>> of
> > >>>>>>>>>>> this
> > >>>>>>>>>>>>> FLIP
> > >>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> limited to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> options, no need for new public
> > >>>>> interfaces.
> > >>>>>>>>>>>>> Everything
> > >>>>>>>>>>>>>>>>>>>>>>>> else
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> remains
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation of Table runtime.
> > >> That
> > >>>>> means
> > >>>>>>> we
> > >>>>>>>>>>> can
> > >>>>>>>>>>>>>>>>>>>>>>>> easily
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> incorporate
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimization potential that
> > >> Alexander
> > >>>>>>> pointed
> > >>>>>>>>>>> out
> > >>>>>>>>>>>>>>>>>>>>>>>> later.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> @Alexander unfortunately, your
> > >>>>> architecture
> > >>>>>>> is
> > >>>>>>>>>>> not
> > >>>>>>>>>>>>>>>>>>>>>>>> shared.
> > >>>>>>>>>>>>>>>>>>>>>>>>>> I
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution to share images to be
> > >> honest.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 22, 2022 at 5:04 PM
> > >>>> Александр
> > >>>>>>>>>>> Смирнов
> > >>>>>>>>>>>> <
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Qingsheng! My name is Alexander,
> > >>> I'm
> > >>>>>>> not a
> > >>>>>>>>>>>>>>>>>>>>>>>> committer
> > >>>>>>>>>>>>>>>>>>>>>>>>>> yet,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> really like to become one. And this
> > >>>> FLIP
> > >>>>>>>>> really
> > >>>>>>>>>>>>>>>>>>>>>>>>>> interested
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> me.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Actually I have worked on a similar
> > >>>>>>> feature in
> > >>>>>>>>>>> my
> > >>>>>>>>>>>>>>>>>>>>>>>>>> company’s
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Flink
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fork, and we would like to share
> > >> our
> > >>>>>>> thoughts
> > >>>>>>>>>>> on
> > >>>>>>>>>>>>>>>>>>>>>>>> this and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> make
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> open source.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I think there is a better
> > >> alternative
> > >>>>> than
> > >>>>>>>>>>>>>>>>>>>>>>>> introducing an
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abstract
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> class for TableFunction
> > >>>>>>>>> (CachingTableFunction).
> > >>>>>>>>>>>> As
> > >>>>>>>>>>>>>>>>>>>>>>>> you
> > >>>>>>>>>>>>>>>>>>>>>>>>>> know,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> TableFunction exists in the
> > >>>>>>> flink-table-common
> > >>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> provides
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only an API for working with
> > >> tables –
> > >>>>> it’s
> > >>>>>>>>> very
> > >>>>>>>>>>>>>>>>>>>>>>>>>> convenient
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> importing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in connectors. In turn,
> > >>>>>>> CachingTableFunction
> > >>>>>>>>>>>>> contains
> > >>>>>>>>>>>>>>>>>>>>>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime execution,  so this class
> > >> and
> > >>>>>>>>>>> everything
> > >>>>>>>>>>>>>>>>>>>>>>>>>> connected
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> should be located in another
> > >> module,
> > >>>>>>> probably
> > >>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> But this will require connectors to
> > >>>>> depend
> > >>>>>>> on
> > >>>>>>>>>>>>> another
> > >>>>>>>>>>>>>>>>>>>>>>>>>> module,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> contains a lot of runtime logic,
> > >>> which
> > >>>>>>> doesn’t
> > >>>>>>>>>>>>> sound
> > >>>>>>>>>>>>>>>>>>>>>>>>>> good.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding a new method
> > >>>>>>>>> ‘getLookupConfig’
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupTableSource
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or LookupRuntimeProvider to allow
> > >>>>>>> connectors
> > >>>>>>>>> to
> > >>>>>>>>>>>>> only
> > >>>>>>>>>>>>>>>>>>>>>>>> pass
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configurations to the planner,
> > >>>> therefore
> > >>>>>>> they
> > >>>>>>>>>>>> won’t
> > >>>>>>>>>>>>>>>>>>>>>>>>>> depend on
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> realization. Based on these configs
> > >>>>> planner
> > >>>>>>>>>>> will
> > >>>>>>>>>>>>>>>>>>>>>>>>>> construct a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lookup
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> join operator with corresponding
> > >>>> runtime
> > >>>>>>> logic
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> (ProcessFunctions
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> module flink-table-runtime).
> > >>>> Architecture
> > >>>>>>>>> looks
> > >>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pinned
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> image (LookupConfig class there is
> > >>>>> actually
> > >>>>>>>>>>> yours
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> CacheConfig).
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Classes in flink-table-planner,
> > >> that
> > >>>> will
> > >>>>>>> be
> > >>>>>>>>>>>>>>>>>>>>>>>> responsible
> > >>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> –
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> CommonPhysicalLookupJoin and his
> > >>>>>>> inheritors.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Current classes for lookup join in
> > >>>>>>>>>>>>>>>>>>>>>>>> flink-table-runtime
> > >>>>>>>>>>>>>>>>>>>>>>>>>> -
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunner,
> > >>>> AsyncLookupJoinRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinRunnerWithCalc,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> AsyncLookupJoinRunnerWithCalc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I suggest adding classes
> > >>>>>>>>>>> LookupJoinCachingRunner,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> LookupJoinCachingRunnerWithCalc,
> > >> etc.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And here comes another more
> > >> powerful
> > >>>>>>> advantage
> > >>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>> such a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> we have caching logic on a lower
> > >>> level,
> > >>>>> we
> > >>>>>>> can
> > >>>>>>>>>>>>> apply
> > >>>>>>>>>>>>>>>>>>>>>>>> some
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> optimizations to it.
> > >>>>>>> LookupJoinRunnerWithCalc
> > >>>>>>>>>>> was
> > >>>>>>>>>>>>>>>>>>>>>>>> named
> > >>>>>>>>>>>>>>>>>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> because it uses the ‘calc’
> > >> function,
> > >>>>> which
> > >>>>>>>>>>>> actually
> > >>>>>>>>>>>>>>>>>>>>>>>>>> mostly
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consists
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> filters and projections.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For example, in join table A with
> > >>>> lookup
> > >>>>>>> table
> > >>>>>>>>>>> B
> > >>>>>>>>>>>>>>>>>>>>>>>>>> condition
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘JOIN …
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ON
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> A.id = B.id AND A.age = B.age + 10
> > >>>> WHERE
> > >>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>> 1000’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ‘calc’
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> function will contain filters
> > >> A.age =
> > >>>>>>> B.age +
> > >>>>>>>>>>> 10
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> B.salary >
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1000.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If we apply this function before
> > >>>> storing
> > >>>>>>>>>>> records
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> size
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache will be significantly
> > >> reduced:
> > >>>>>>> filters =
> > >>>>>>>>>>>>> avoid
> > >>>>>>>>>>>>>>>>>>>>>>>>>> storing
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> useless
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> records in cache, projections =
> > >>> reduce
> > >>>>>>>>> records’
> > >>>>>>>>>>>>>>>>>>>>>>>> size. So
> > >>>>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> initial
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> max number of records in cache can
> > >> be
> > >>>>>>>>> increased
> > >>>>>>>>>>>> by
> > >>>>>>>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>>>>>>>> user.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What do you think about it?
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 2022/04/19 02:47:11 Qingsheng
> > >> Ren
> > >>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi devs,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yuan and I would like to start a
> > >>>>>>> discussion
> > >>>>>>>>>>>> about
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP-221[1],
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> introduces an abstraction of lookup
> > >>>> table
> > >>>>>>>>> cache
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>> its
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> standard
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Currently each lookup table source
> > >>>>> should
> > >>>>>>>>>>>>> implement
> > >>>>>>>>>>>>>>>>>>>>>>>>>> their
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> cache to
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> store lookup results, and there
> > >>> isn’t a
> > >>>>>>>>>>> standard
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> developers to tuning their jobs
> > >> with
> > >>>>> lookup
> > >>>>>>>>>>>> joins,
> > >>>>>>>>>>>>>>>>>>>>>>>> which
> > >>>>>>>>>>>>>>>>>>>>>>>>>> is a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quite
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> common
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use case in Flink table / SQL.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Therefore we propose some new APIs
> > >>>>>>> including
> > >>>>>>>>>>>>> cache,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrapper
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> classes of TableFunction and new
> > >>> table
> > >>>>>>>>> options.
> > >>>>>>>>>>>>>>>>>>>>>>>> Please
> > >>>>>>>>>>>>>>>>>>>>>>>>>> take a
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> look
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at the
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FLIP page [1] to get more details.
> > >>> Any
> > >>>>>>>>>>>> suggestions
> > >>>>>>>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> comments
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-221+Abstraction+for+lookup+source+cache+and+metric
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Email: [email protected]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Roman Boyko
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.: [email protected]
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Qingsheng Ren
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Real-time Computing Team
> > >>>>>>>>>>>>>>>>>>>>> Alibaba Cloud
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Email: [email protected]
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] FLIP-221 Abstraction for lookup source cache and metric

Reply via email to