This is an automated email from the ASF dual-hosted git repository.

ipolyzos pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git


The following commit(s) were added to refs/heads/main by this push:
     new eb75d5570 [docs] Document COMPACTED table format (#2264)
eb75d5570 is described below

commit eb75d557096f7af2a1bfca1e285153394d99ea8d
Author: Priya Manjare <[email protected]>
AuthorDate: Tue Jan 6 01:10:56 2026 +0530

    [docs] Document COMPACTED table format (#2264)
    
    * [docs] Document COMPACTED table format
    
    * [docs] Add detailed explanation for COMPACTED format
    
    * [docs] Add Data Encodings documentation
    
    * [docs] Remove standalone COMPACTED doc in favor of Data Encodings page
    
    * docs: updated Data Encodings page
    
    * docs: improve Data Encodings overview
    
    * small improvements
    
    ---------
    
    Co-authored-by: ipolyzos <[email protected]>
---
 website/docs/engine-flink/options.md      |   2 +-
 website/docs/table-design/data-formats.md | 137 ++++++++++++++++++++++++++++++
 2 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/website/docs/engine-flink/options.md 
b/website/docs/engine-flink/options.md
index 3c06c3ccb..ad389b1f5 100644
--- a/website/docs/engine-flink/options.md
+++ b/website/docs/engine-flink/options.md
@@ -73,7 +73,7 @@ See more details about [ALTER TABLE ... 
SET](engine-flink/ddl.md#set-properties)
 | table.auto-partition.num-retention      | Integer  | 7                       
            | The number of history partitions to retain for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then 
the history partitions 20241108, 20241109, 20241110 will be retained. The 
partitions earlier than 20241108 will be deleted. The default value is 7, which 
means that 7 partitions will  [...]
 | table.auto-partition.time-zone          | String   | the system time zone    
            | The time zone for auto partitions, which is by default the same 
as the system time zone.                                                        
                                                                                
                                                                                
                                                                                
                [...]
 | table.replication.factor                | Integer  | (None)                  
            | The replication factor for the log of the new table. When it's 
not set, Fluss will use the cluster's default replication factor configured by 
default.replication.factor. It should be a positive number and not larger than 
the number of tablet servers in the Fluss cluster. A value larger than the 
number of tablet servers in Fluss cluster will result in an error when the new 
table is created.        [...]
-| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW` and `INDEXED`.                       
                                                                                
                                                                                
                                                                                
               [...]
+| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW`, `INDEXED` and `COMPACTED`.          
                                                                                
                                                                                
                                                                                
               [...]
 | table.log.arrow.compression.type        | Enum     | ZSTD                    
            | The compression type of the log records if the log format is set 
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The 
default value is `ZSTD`.                                                        
                                                                                
                                                                                
                [...]
 | table.log.arrow.compression.zstd.level  | Integer  | 3                       
            | The compression level of the log records if the log format is set 
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to 
22. The default value is 3.                                                     
                                                                                
                                                                                
                [...]
 | table.kv.format                         | Enum     | COMPACTED               
            | The format of the kv records in kv store. The default value is 
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.               
                                                                                
                                                                                
                                                                                
                 [...]
diff --git a/website/docs/table-design/data-formats.md 
b/website/docs/table-design/data-formats.md
new file mode 100644
index 000000000..d509312a2
--- /dev/null
+++ b/website/docs/table-design/data-formats.md
@@ -0,0 +1,137 @@
+---
+title: Storage Formats
+---
+
+In Fluss, a storage format primarily defines **how data is stored and 
accessed**. Each format is designed to balance storage efficiency, read 
performance, and query capabilities.
+
+This page describes the available formats in Fluss and provides guidance on 
selecting the appropriate format based on workload characteristics.
+
+### How to Think About Formats in Fluss
+
+At a high level, a format determines:
+- How data is laid out on disk (columnar vs row-oriented)
+- How efficiently data can be scanned, filtered, or projected
+- Whether the workload is optimized for streaming scans or key-based access
+
+Formats in Fluss determine:
+
+- CPU vs IO trade-offs
+- Scan-heavy vs lookup-heavy workloads
+- Analytical vs operational access patterns
+
+---
+
+## Log Format and KV Format
+
+In Fluss, storage formats can be used in two different ways, depending on how 
the data is accessed.
+
+- **Log format** is designed for reading data in order, as it is written.
+  It is commonly used for streaming workloads, append-only tables, and 
changelog-style data.
+
+- **KV format** is designed for accessing data by key.
+  It is used for workloads where queries look up or update values using a key 
and only the most recent value for each key is needed.
+
+ARROW can be used as log format, while COMPACTED supports both log and KV 
formats.
+
+## ARROW Format (Default)
+
+### Overview
+
+ARROW is the **default log format** in Fluss. It stores data in a columnar 
layout, organizing information by columns rather than rows. This layout is well 
suited for analytical and streaming workloads.
+
+### Key Features
+
+- **Column pruning**: Reads only the columns required by a query
+- **Predicate pushdown**: Applies filters efficiently at the storage layer
+- **Arrow ecosystem integration**: Compatible with Arrow-based processing 
frameworks
+
+### When to Use ARROW
+
+ARROW is recommended for:
+- Analytical queries that access a subset of columns
+- Streaming workloads with selective column reads
+- General-purpose tables with varying query patterns
+- Workloads that benefit from predicate pushdown
+
+### ARROW Trade-offs
+
+ARROW is less efficient for workloads that:
+- Always read all columns
+- Mostly access individual rows by key
+
+---
+
+## COMPACTED Format
+
+### Overview
+
+COMPACTED uses a **row-oriented format** that focuses on reducing storage size 
and CPU usage. It is optimized for workloads where queries typically access 
entire rows rather than individual columns.
+
+### Key Features
+
+- **Reduced storage overhead**: Variable-length encoding minimizes disk usage
+- **Lower CPU overhead**: Efficient when all columns are accessed together
+- **Row-oriented access**: Optimized for full-row reads
+- **Key-value support**: Can be configured for key-based access patterns
+
+### When to Use COMPACTED
+
+COMPACTED is recommended for:
+- Tables where queries usually select all columns
+- Large vector or embedding tables
+- Pre-aggregated results or materialized views
+- Denormalized or joined tables
+- Workloads that prioritize storage efficiency over selective column access
+
+---
+
+## Configuration
+
+To enable the COMPACTED format for log data, set the `table.log.format` option:
+
+```sql
+CREATE TABLE my_table (
+  id BIGINT,
+  data STRING,
+  PRIMARY KEY (id) NOT ENFORCED
+) WITH (
+  'table.log.format' = 'COMPACTED'
+);
+```
+
+### COMPACTED with WAL Changelog Image
+
+For key-based workloads that only require the **latest value per key**, the 
COMPACTED format can be used for both log and kv data, combined with the WAL 
changelog image mode.
+
+```sql
+CREATE TABLE kv_table (
+  key STRING,
+  value STRING,
+  PRIMARY KEY (key) NOT ENFORCED
+) WITH (
+  'table.log.format' = 'COMPACTED',
+  'table.kv.format' = 'COMPACTED',
+  'table.changelog.image' = 'WAL'
+);
+```
+### COMPACTED Trade-offs
+
+COMPACTED is not recommended when:
+- Queries need to read only a few columns from a table
+- Filters are applied to reduce the amount of data read
+- Analytical workloads require flexible access to individual columns
+- Historical changes or full changelog data must be preserved
+
+## ARROW vs COMPACTED
+
+| Feature                | ARROW                               | COMPACTED     
                     |
+|------------------------|-------------------------------------|------------------------------------|
+| Physical layout        | Columnar                            | Row-oriented  
                     |
+| Typical access pattern | Scans with projection & filters     | Full-row 
reads or key lookups      |
+| Column pruning         | ✅ Yes                             | ❌ No            
                  |
+| Predicate pushdown     | ✅ Yes                              | ❌ No           
                  |
+| Storage efficiency     | Good                                | Excellent     
                     |
+| CPU efficiency         | Better for selective reads          | Better for 
full-row reads          |
+| Log format             | ✅ Yes                              | ✅ Yes          
                  |
+| KV format              | ❌ No                               | ✅ Yes          
                  |
+| Best suited for        | Analytics workloads                 | State tables 
/ materialized data   |
\ No newline at end of file

Reply via email to