Recently, after discussions within the team, we have some new ideas for everyone to consider:
Idea A: Creating an external table in Hive allows direct reading of Paimon tag data (data before tagging is unreadable). Paimon reads all data in real-time. The same data is stored in one location but can be accessed in two different ways, which might involve reconfiguring the underlying file directory structure. The design would involve options to trigger merging when tagging and to specify directories when tagging. 1. If a directory is specified, the original Hive table does not need to be changed. 2. If no directory is specified, an external Hive table is created and pointed to the `/data` directory. If you want to read historical partitions, you can create a view on the Hive side. Idea B: I took a look at Iceberg's `MigrateTableSparkAction` table upgrade logic. I am considering implementing a Hive to Paimon table upgrade based on the `MigrateTableSparkAction` approach. This upgrade would support basic table upgrades, such as from Hive to append-only tables. Feel free to provide feedback or ask questions about these proposed ideas. 陈卓宇 <[email protected]> 于2023年8月25日周五 15:32写道: > I believe that we are onto an exciting prospect with this idea. Here are > the specific needs that our company could foresee, given the theme: > > 1. **Transition to Paimon Tables from Hive ODS Tables**: Our current > system boasts a significant number of Hive ODS tables, with partitions set > daily. Each of these partitions encapsulates comprehensive business data > sourced directly from MySQL. We are contemplating an in-place transition to > Paimon tables. The rationale behind this move is twofold: First, it would > obviate the need to modify the SQL code amidst the existing plethora of > Hive batch processing logic. Secondly, this transition promises the > advantage of real-time data access, shrinking the delay to mere minutes and > also adding the benefit of stream reading capabilities. > > 2. **Integration with Historical Hive Partitions**: The Hive system has > been an integral part of our data structure, with over a thousand > partitions to its credit. Ideally, a view table that can meld the > functionalities of a Paimon table and the vastness of historical Hive > partitions would be a valuable addition. In such a scenario, users > interacting with this view table would be directed to the Paimon tag when a > tag is present, and to the historical Hive partitions in its absence. > > 3. **Tag-Based Processing with 'dt'**: We employ a tagging system rooted > in the 'dt' parameter. Keeping this in mind, processing using these tags > should ideally support a range of operations, such as "between and", > comparative functions like greater than or less than, and even group by > operations centered around these tags. To illustrate, the system should be > adept at handling queries akin to: > ```SQL > SELECT dt, COUNT(*) FROM table WHERE dt BETWEEN a AND b GROUP BY dt > ``` > > Best, > ZhuoyuChen > > Jingsong Li <[email protected]> 于2023年8月25日周五 13:58写道: > >> Hi, devs. >> >> Now, Pailin supports tags, which provide a snapshot view to time travel, >> this can be something similar to partition table to replace hive full >> partitioned table and incremental partitioned table. >> >> But, this requires uses to change their sql to use time travel, and it is >> not good to use time travel in hive sql now. >> >> So, I plan to create a new feature view table, we can create view table to >> mapping non-partitioned table to partitioned table, it’s partition field >> is >> tag. This feature can let Pailin table 100% compatible to old hive table. >> >> What do you think? >> >> Any requirements? >> >> Best, >> Jingsong >> >
