I just created https://github.com/apache/incubator-paimon/pull/2101 for file scanning.
Best, Jingsong On Mon, Oct 9, 2023 at 9:42 AM Jingsong Li <[email protected]> wrote: > > Hi xiangyu, > > Very glad to hear from you. > Very welcome to participate in the development, and I believe we can > have many technologies to share. > > Snapshot File Scanner should be something like > `MultiTablesStreamingCompactorSourceFunction`. Using a StreamTableScan > can continuously read files from Table. > > Best, > Jingsong > > On Sun, Oct 8, 2023 at 9:20 PM xiangyu feng <[email protected]> wrote: > > > > Hi Jingsong, > > > > Thanks for bring up this discussion. This is exactly what we want for > > Paimon and we have met real user cases internally in ByteDance. > > > > Let me introduce our situation first, we have a search business partner > > that needs to perform joins between large tables and small tables > > periodically. The large table size is around 100TB and unshuffled, the > > small table size is around 100GB. Our users don't want to shuffle and sort > > the big table in the first place since it's very resource and time > > consuming. Meanwhile, the small table is also too large to be broadcasted. > > > > To solve this problem, we have launched a long running Flink job as lookup > > service. In this job, each subtask will initiate a LevelDB locally within > > partitioned small table files and register the meta information to ZK for > > service discovery and provide lookup grpc service. Also a lookup client > > will be offered for users to call this rpc service.Then we will use a > > separate map-only job to scan large table and perform the lookup join by > > client. In this way, our users can finish the join operation in hours. > > > > This architecture is working well for our users in years and recently they > > are trying to upgrade this architecture to improve the overall performance > > and usability. Within the QueryService provided by Paimon, I believe we can > > solve this problem in a more general way. > > > > So overall I'm big +1 for this new feature. Also my colleagues and I are > > more than willing to participate in the development and help evolving this > > feature in production. > > > > For the design doc, I'm curious about how will the Snapshot File Scanner be > > designed and implemented. It will be great if we can get more information > > about this. > > > > Regards, > > Xiangyu > > > > Jingsong Li <[email protected]> 于2023年10月8日周日 18:35写道: > >> > >> Hi all, > >> > >> I want to bring up a discussion about Paimon QueryService [1]. > >> > >> Paimon primary key table already provides LSM file structure, it is a > >> pity that the paimon can not provide a queryable service for lookup. > >> > >> A distributed service can download Paimon files locally and provide a > >> Lookup service. It does not affect the write process and read process, > >> it is a separate server. It can be used as: > >> > >> 1. Flink Lookup Join, reuse by multiple Flink Jobs. > >> 2. Online Service Lookup, this requires high stability. (it may not be > >> so stable in the first version) > >> > >> See more in PIP [1]. > >> > >> This PIP is a high-level design for Paimon QueryService, not including > >> all details. > >> > >> [1] > >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-10%3A+Introduce+Paimon+QueryService > >> > >> Best, > >> Jingsong
