Re: [PR] [docs] Update datalake related doc for support iceberg and lance [fluss]

via GitHub Thu, 11 Sep 2025 20:09:25 -0700


luoyuxia commented on code in PR #1686:
URL: https://github.com/apache/fluss/pull/1686#discussion_r2342827044



##########
website/docs/quickstart/flink.md:
##########
@@ -346,10 +346,13 @@ The following SQL query should return an empty result.
 SELECT * FROM fluss_customer WHERE `cust_key` = 1;
 ```
 
-## Integrate with Paimon
+## Integrate with Data Lake

Review Comment:
   Also. not change this doc. Mixing other lake formats make it hard to follow. 
We can have a seperate doc for iceberg.



##########
website/docs/install-deploy/overview.md:
##########
@@ -117,7 +117,8 @@ We have listed them in the table below the figure.
             </td>
             <td>
             <li>[Paimon](maintenance/tiered-storage/lakehouse-storage.md)</li>

Review Comment:
   update all to
   ```
    <li>[Paimon](streaming-lakehouse/integrate-data-lakes/paimon.md)</li>
               
<li>[Iceberg](streaming-lakehouse/integrate-data-lakes/iceberg.md)</li>
               
<li>[Lance](streaming-lakehouse/integrate-data-lakes/lance.md)</li>
   ```
   



##########
website/docs/maintenance/tiered-storage/lakehouse-storage.md:
##########
@@ -45,8 +45,10 @@ datalake.paimon.uri: 
thrift://<hive-metastore-host-name>:<port>
 datalake.paimon.warehouse: hdfs:///path/to/warehouse
 ```
 #### Add other jars required by datalake
-While Fluss includes the core Paimon library, additional jars may still need 
to be manually added to `${FLUSS_HOME}/plugins/paimon/` according to your needs.
-For example, for OSS filesystem support, you need to put 
`paimon-oss-<paimon_version>.jar` into directory 
`${FLUSS_HOME}/plugins/paimon/`.
+While Fluss includes the core libraries for supported data lake formats, 
additional jars may still need to be manually added according to your needs.
+For Paimon: Put additional jars into `${FLUSS_HOME}/plugins/paimon/`, e.g., 
for OSS filesystem support, put `paimon-oss-<paimon_version>.jar`

Review Comment:
   Not change this doc.  I don't want to mix other formats in this doc since 
it'll make it hard to follow this guidance.



##########
website/docs/maintenance/tiered-storage/lakehouse-storage.md:
##########
@@ -58,10 +60,17 @@ Then, you must start the datalake tiering service to tier 
Fluss's data to the la
 - Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you 
should choose a connector version matching your Flink version. If you're using 
Flink 1.20, please use 
[fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
 - If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun 
OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File 
System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote 
storage](maintenance/tiered-storage/remote-storage.md),
   you should download the corresponding [Fluss filesystem 
jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
-- Put [fluss-lake-paimon 
jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar)
 into `${FLINK_HOME}/lib`
+- For Paimon integration:

Review Comment:
   dito. Not change this for it make hard to follow



##########
website/docs/maintenance/configuration.md:
##########
@@ -163,9 +163,9 @@ during the Fluss cluster working.
 
 ## Lakehouse
 
-| Option          | Type | Default | Description                               
                                                                                
|
-|-----------------|------|---------|---------------------------------------------------------------------------------------------------------------------------|
-| datalake.format | Enum | (None)  | The datalake format used by of Fluss to 
be as lakehouse storage, such as Paimon, Iceberg, Hudi. Now, only support 
Paimon. |
+| Option          | Type | Default | Description                               
                                                                                
                                                                                
                  |
+|-----------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| datalake.format | Enum | (None)  | The datalake format used by of Fluss to 
be as lakehouse storage. Currently, supported formats are Paimon, Iceberg, and 
Lance. In the future, more kinds of data lake format will be supported, such as 
DeltaLake or Hudi.   |

Review Comment:
   Also to update it in `ConfigOptions`



##########
website/docs/engine-flink/options.md:
##########
@@ -60,29 +60,29 @@ ALTER TABLE log_table SET ('table.log.ttl' = '7d');
 
 ## Storage Options
 
-| Option                                  | Type     | Default                 
            | Description                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         |
-|-----------------------------------------|----------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| bucket.num                              | int      | The bucket number of 
Fluss cluster. | The number of buckets of a Fluss table.                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
-| bucket.key                              | String   | (None)                  
            | Specific the distribution policy of the Fluss table. Data will be 
distributed to each bucket according to the hash value of bucket-key (It must 
be a subset of the primary keys excluding partition keys of the primary key 
table). If you specify multiple fields, delimiter is `,`. If the table has a 
primary key and a bucket key is not specified, the bucket key will be used as 
primary key(excluding the partition key). If the table has no primary key and 
the bucket key is not specified, the data will be distributed to each bucket 
randomly.                                                                       
                                                                                
                                                                                
                                                         |
-| table.log.ttl                           | Duration | 7 days                  
            | The time to live for log segments. The configuration controls the 
maximum time we will retain a log before we will delete old segments to free up 
space. If set to -1, the log will not be deleted.                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         |
-| table.auto-partition.enabled            | Boolean  | false                   
            | Whether enable auto partition for the table. Disable by default. 
When auto partition is enabled, the partitions of the table will be created 
automatically.                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                              |
-| table.auto-partition.key                | String   | (None)                  
            | This configuration defines the time-based partition key to be 
used for auto-partitioning when a table is partitioned with multiple keys. 
Auto-partitioning utilizes a time-based partition key to handle partitions 
automatically, including creating new ones and removing outdated ones, by 
comparing the time value of the partition with the current system time. In the 
case of a table using multiple partition keys (such as a composite partitioning 
strategy), this feature determines which key should serve as the primary time 
dimension for making auto-partitioning decisions. And If the table has only one 
partition key, this config is not necessary. Otherwise, it must be specified.   
                                                                                
                                                                |
-| table.auto-partition.time-unit          | ENUM     | DAY                     
            | The time granularity for auto created partitions. The default 
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If 
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If 
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the 
value is `MONTH`, the partition format for auto created is yyyyMM. If the value 
is `QUARTER`, the partition format for auto created is yyyyQ. If the value is 
`YEAR`, the partition format for auto created is yyyy.                          
                                                                                
                                                                                
                                                                                
                                                     |
-| table.auto-partition.num-precreate      | Integer  | 2                       
            | The number of partitions to pre-create for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11 and the value is configured as 3, then partitions 20241111, 
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip 
creating the partition. The default value is 2, which means 2 partitions will 
be pre-created. If the `table.auto-partition.time-unit` is `DAY`(default), one 
precreated partition is for today and another one is for tomorrow. For a 
partition table with multiple partition keys, pre-create is unsupported and 
will be set to 0 automatically when creating table if it is not explicitly 
specified.                                                                      
                                                                        |
-| table.auto-partition.num-retention      | Integer  | 7                       
            | The number of history partitions to retain for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then 
the history partitions 20241108, 20241109, 20241110 will be retained. The 
partitions earlier than 20241108 will be deleted. The default value is 7, which 
means that 7 partitions will be retained.                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                         |
-| table.auto-partition.time-zone          | String   | the system time zone    
            | The time zone for auto partitions, which is by default the same 
as the system time zone.                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                           |
-| table.replication.factor                | Integer  | (None)                  
            | The replication factor for the log of the new table. When it's 
not set, Fluss will use the cluster's default replication factor configured by 
default.replication.factor. It should be a positive number and not larger than 
the number of tablet servers in the Fluss cluster. A value larger than the 
number of tablet servers in Fluss cluster will result in an error when the new 
table is created.                                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                    |
-| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW` and `INDEXED`.                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
-| table.log.arrow.compression.type        | Enum     | ZSTD                    
            | The compression type of the log records if the log format is set 
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The 
default value is `ZSTD`.                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                           |
-| table.log.arrow.compression.zstd.level  | Integer  | 3                       
            | The compression level of the log records if the log format is set 
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to 
22. The default value is 3.                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                           |
-| table.kv.format                         | Enum     | COMPACTED               
            | The format of the kv records in kv store. The default value is 
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                            |
-| table.log.tiered.local-segments         | Integer  | 2                       
            | The number of log segments to retain in local for each table when 
log tiered storage is enabled. It must be greater that 0. The default is 2.     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         |
-| table.datalake.enabled                  | Boolean  | false                   
            | Whether enable lakehouse storage for the table. Disabled by 
default. When this option is set to ture and the datalake tiering service is 
up, the table will be tiered and compacted into datalake format stored on 
lakehouse storage.                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                        |
-| table.datalake.format                   | Enum     | (None)                  
            | The data lake format of the table specifies the tiered Lakehouse 
storage format, such as Paimon, Iceberg, DeltaLake, or Hudi. Currently, only 
`paimon` is supported. Once the `table.datalake.format` property is configured, 
Fluss adopts the key encoding and bucketing strategy used by the corresponding 
data lake format. This ensures consistency in key encoding and bucketing, 
enabling seamless **Union Read** functionality across Fluss and Lakehouse. The 
`table.datalake.format` can be pre-defined before enabling 
`table.datalake.enabled`. This allows the data lake feature to be dynamically 
enabled on the table without requiring table recreation. If 
`table.datalake.format` is not explicitly set during table creation, the table 
will default to the format specified by the `datalake.format` configuration in 
the Fluss cluster |
-| table.datalake.freshness                | Duration | 3min                    
            | It defines the maximum amount of time that the datalake table's 
content should lag behind updates to the Fluss table. Based on this target 
freshness, the Fluss service automatically moves data from the Fluss table and 
updates to the datalake table, so that the data in the datalake table is kept 
up to date within this target. If the data does not need to be as fresh, you 
can specify a longer target freshness time to reduce costs.                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |
-| table.datalake.auto-compaction          | Boolean | false                    
            | If true, compaction will be triggered automatically when tiering 
service writes to the datalake. It is disabled by default.                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                  |
-| table.merge-engine                      | Enum     | (None)                  
            | Defines the merge engine for the primary key table. By default, 
primary key table uses the [default merge 
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md). 
It also supports two merge engines are `first_row` and `versioned`. The 
[first_row merge 
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep 
the first row of the same primary key. The [versioned merge 
engine](table-design/table-types/pk-table/merge-engines/versioned.md) will keep 
the row with the largest version of the same primary key.                       
                                                                                
                                                                                
                                                                                
             |
-| table.merge-engine.versioned.ver-column | String   | (None)                  
            | The column name of the version column for the `versioned` merge 
engine. If the merge engine is set to `versioned`, the version column must be 
set.                                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
+| Option                                  | Type     | Default                 
            | Description                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     |
+|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| bucket.num                              | int      | The bucket number of 
Fluss cluster. | The number of buckets of a Fluss table.                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        |
+| bucket.key                              | String   | (None)                  
            | Specific the distribution policy of the Fluss table. Data will be 
distributed to each bucket according to the hash value of bucket-key (It must 
be a subset of the primary keys excluding partition keys of the primary key 
table). If you specify multiple fields, delimiter is `,`. If the table has a 
primary key and a bucket key is not specified, the bucket key will be used as 
primary key(excluding the partition key). If the table has no primary key and 
the bucket key is not specified, the data will be distributed to each bucket 
randomly.                                                                       
                                                                                
                                                                                
                                                                                
                                                     |
+| table.log.ttl                           | Duration | 7 days                  
            | The time to live for log segments. The configuration controls the 
maximum time we will retain a log before we will delete old segments to free up 
space. If set to -1, the log will not be deleted.                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     |
+| table.auto-partition.enabled            | Boolean  | false                   
            | Whether enable auto partition for the table. Disable by default. 
When auto partition is enabled, the partitions of the table will be created 
automatically.                                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          |
+| table.auto-partition.key                | String   | (None)                  
            | This configuration defines the time-based partition key to be 
used for auto-partitioning when a table is partitioned with multiple keys. 
Auto-partitioning utilizes a time-based partition key to handle partitions 
automatically, including creating new ones and removing outdated ones, by 
comparing the time value of the partition with the current system time. In the 
case of a table using multiple partition keys (such as a composite partitioning 
strategy), this feature determines which key should serve as the primary time 
dimension for making auto-partitioning decisions. And If the table has only one 
partition key, this config is not necessary. Otherwise, it must be specified.   
                                                                                
                                                                                
                                                            |
+| table.auto-partition.time-unit          | ENUM     | DAY                     
            | The time granularity for auto created partitions. The default 
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If 
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If 
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the 
value is `MONTH`, the partition format for auto created is yyyyMM. If the value 
is `QUARTER`, the partition format for auto created is yyyyQ. If the value is 
`YEAR`, the partition format for auto created is yyyy.                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                 |
+| table.auto-partition.num-precreate      | Integer  | 2                       
            | The number of partitions to pre-create for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11 and the value is configured as 3, then partitions 20241111, 
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip 
creating the partition. The default value is 2, which means 2 partitions will 
be pre-created. If the `table.auto-partition.time-unit` is `DAY`(default), one 
precreated partition is for today and another one is for tomorrow. For a 
partition table with multiple partition keys, pre-create is unsupported and 
will be set to 0 automatically when creating table if it is not explicitly 
specified.                                                                      
                                                                                
                                                                    |
+| table.auto-partition.num-retention      | Integer  | 7                       
            | The number of history partitions to retain for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then 
the history partitions 20241108, 20241109, 20241110 will be retained. The 
partitions earlier than 20241108 will be deleted. The default value is 7, which 
means that 7 partitions will be retained.                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                     |
+| table.auto-partition.time-zone          | String   | the system time zone    
            | The time zone for auto partitions, which is by default the same 
as the system time zone.                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       |
+| table.replication.factor                | Integer  | (None)                  
            | The replication factor for the log of the new table. When it's 
not set, Fluss will use the cluster's default replication factor configured by 
default.replication.factor. It should be a positive number and not larger than 
the number of tablet servers in the Fluss cluster. A value larger than the 
number of tablet servers in Fluss cluster will result in an error when the new 
table is created.                                                               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                |
+| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW` and `INDEXED`.                       
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      |
+| table.log.arrow.compression.type        | Enum     | ZSTD                    
            | The compression type of the log records if the log format is set 
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The 
default value is `ZSTD`.                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       |
+| table.log.arrow.compression.zstd.level  | Integer  | 3                       
            | The compression level of the log records if the log format is set 
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to 
22. The default value is 3.                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       |
+| table.kv.format                         | Enum     | COMPACTED               
            | The format of the kv records in kv store. The default value is 
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                        |
+| table.log.tiered.local-segments         | Integer  | 2                       
            | The number of log segments to retain in local for each table when 
log tiered storage is enabled. It must be greater that 0. The default is 2.     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                     |
+| table.datalake.enabled                  | Boolean  | false                   
            | Whether enable lakehouse storage for the table. Disabled by 
default. When this option is set to ture and the datalake tiering service is 
up, the table will be tiered and compacted into datalake format stored on 
lakehouse storage.                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                    |
+| table.datalake.format                   | Enum     | (None)                  
            | The data lake format of the table specifies the tiered Lakehouse 
storage format. Currently, supported formats are `paimon`, `iceberg`, and 
`lance`. In the future, more kinds of data lake format will be supported, such 
as DeltaLake or Hudi. Once the `table.datalake.format` property is configured, 
Fluss adopts the key encoding and bucketing strategy used by the corresponding 
data lake format. This ensures consistency in key encoding and bucketing, 
enabling seamless **Union Read** functionality across Fluss and Lakehouse. The 
`table.datalake.format` can be pre-defined before enabling 
`table.datalake.enabled`. This allows the data lake feature to be dynamically 
enabled on the table without requiring table recreation. If 
`table.datalake.format` is not explicitly set during table creation, the table 
will default to the format specified by the `datalake.format` configuration in 
the Fluss cluster. |

Review Comment:
   Also to update it in ConfigOptions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [docs] Update datalake related doc for support iceberg and lance [fluss]

Reply via email to