Aitozi opened a new issue, #7876:
URL: https://github.com/apache/paimon/issues/7876

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   ## Motivation
   
   Many tables store semi-structured attributes as `MAP<STRING, STRING>`, for 
example HTTP headers,
   request parameters, or extensible entity properties. This model is flexible, 
but it can be
   inefficient when a small set of keys is present in most rows and is 
frequently used by queries:
   
   - common keys such as `content-type`, `user-agent`, `locale`, or `campaign` 
are repeatedly stored in
     the map branch;
   - filters and projections frequently access a small set of these keys;
   - a standard Parquet map stores keys and values in the repeated `key_value` 
branch, so readers must
     scan the map payload even when a query only needs one common key.
   
   This proposal adds a Paimon-aware Parquet encoding for selected map columns. 
The logical table
   schema remains `MAP<STRING, STRING>`, while the physical Parquet file stores 
configured hot keys as
   independent sidecar columns.
   
   ## Goals
   
   - Keep the Paimon logical type unchanged. Users still read and write a 
normal map column.
   - Avoid introducing a custom Parquet logical type.
   - Reduce storage by removing promoted hot keys from the residual map branch 
and improve the compression ratio for independent columns as well..
   - Enable column pruning and future predicate pushdown for expressions such as
     `headers['content-type']`.
   - Preserve compatibility for files written without map shredding.
   
   ## Non-Goals
   
   - This proposal does not define a new Parquet standard logical type.
   - The first implementation only supports top-level `MAP<STRING, STRING>` 
columns.
   - Dynamic key discovery is left for a follow-up. The initial version uses 
static keys configured in
     table options.
   
   ## Physical Layout
   
   For a logical column:
   
   ```text
   headers MAP<STRING, STRING>
   ```
   
   Paimon may write the following Parquet physical layout:
   
   ```text
   headers                         // standard Parquet MAP, containing only 
residual keys
   __paimon_map_shred_headers_0    // value for configured hot key 0
   __paimon_map_shred_headers_1    // value for configured hot key 1
   ...
   ```
   
   For example, if the configured keys are `content-type,user-agent`, a row:
   
   ```text
   headers = {
     "content-type": "application/json",
     "user-agent": "mobile-app",
     "x-request-id": "req-001"
   }
   ```
   
   is written as:
   
   ```text
   headers = {
     "x-request-id": "req-001"
   }
   __paimon_map_shred_headers_0 = "application/json"
   __paimon_map_shred_headers_1 = "mobile-app"
   ```
   
   When Paimon reads the file, it reconstructs the logical map by merging the 
residual map with the
   sidecar values.
   
   ## Metadata
   
   The encoding is a Paimon physical optimization. Paimon records the 
configured keys in Parquet file
   metadata with keys under:
   
   ```text
   paimon.map.shredding.<column>.keys
   ```
   
   Table options define the writer and reader contract:
   
   ```text
   parquet.map.shredding.columns = headers
   parquet.map.shredding.headers.keys = content-type,user-agent,locale
   ```
   
   If a file does not contain sidecar columns, Paimon reads the map as a normal 
Parquet map.
   
   ## Semantics
   
   - The residual map is a valid standard Parquet map.
   - Promoted non-null hot-key values are removed from the residual map and 
written to sidecar columns.
   - Null hot-key values remain in the residual map. This preserves the 
distinction between an explicit
     null value and an absent key.
   - External Parquet readers that are not Paimon-aware will see only the 
residual map and sidecar
     columns. They cannot reconstruct the original logical map without Paimon 
metadata and semantics.
   
   ## Query Benefits
   
   The encoding makes it possible to rewrite:
   
   ```sql
   headers['content-type']
   ```
   
   to the corresponding sidecar column when the key is configured as hot. This 
enables:
   
   - reading only the sidecar column for frequently accessed key projections;
   - skipping the repeated map branch for frequently accessed key filters;
   - using Parquet column statistics and page indexes for these predicates in a 
follow-up.
   
   ## Storage Benefits
   
   The encoding avoids repeating common key strings in every map entry and lets 
promoted values be
   compressed as normal scalar columns. This is especially useful when the same 
keys appear in a large
   fraction of rows.
   
   ## Compatibility
   
   This is not a general Parquet MAP representation. It is a Paimon-aware 
residual encoding. The
   advantage over a custom Parquet logical type is that the file still uses 
standard Parquet columns and
   does not require changes in parquet-format or parquet-mr.
   
   The trade-off is that old Paimon versions and generic Parquet readers cannot 
reconstruct the complete
   logical map. This is acceptable only when the table explicitly enables the 
feature.
   
   ## Future Work
   
   - Support nested map columns.
   - Infer hot keys from buffered rows, similar to Variant shredding schema 
inference.
   - Push down hot-key projection and predicates from Spark/Flink planners to 
the Parquet reader.
   
   ### Solution
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to