luoyuxia commented on code in PR #1686: URL: https://github.com/apache/fluss/pull/1686#discussion_r2342827044
##########
website/docs/quickstart/flink.md:
##########
@@ -346,10 +346,13 @@ The following SQL query should return an empty result.
SELECT * FROM fluss_customer WHERE `cust_key` = 1;
```
-## Integrate with Paimon
+## Integrate with Data Lake
Review Comment:
Also. not change this doc. Mixing other lake formats make it hard to follow.
We can have a seperate doc for iceberg.
##########
website/docs/install-deploy/overview.md:
##########
@@ -117,7 +117,8 @@ We have listed them in the table below the figure.
</td>
<td>
<li>[Paimon](maintenance/tiered-storage/lakehouse-storage.md)</li>
Review Comment:
update all to
```
<li>[Paimon](streaming-lakehouse/integrate-data-lakes/paimon.md)</li>
<li>[Iceberg](streaming-lakehouse/integrate-data-lakes/iceberg.md)</li>
<li>[Lance](streaming-lakehouse/integrate-data-lakes/lance.md)</li>
```
##########
website/docs/maintenance/tiered-storage/lakehouse-storage.md:
##########
@@ -45,8 +45,10 @@ datalake.paimon.uri:
thrift://<hive-metastore-host-name>:<port>
datalake.paimon.warehouse: hdfs:///path/to/warehouse
```
#### Add other jars required by datalake
-While Fluss includes the core Paimon library, additional jars may still need
to be manually added to `${FLUSS_HOME}/plugins/paimon/` according to your needs.
-For example, for OSS filesystem support, you need to put
`paimon-oss-<paimon_version>.jar` into directory
`${FLUSS_HOME}/plugins/paimon/`.
+While Fluss includes the core libraries for supported data lake formats,
additional jars may still need to be manually added according to your needs.
+For Paimon: Put additional jars into `${FLUSS_HOME}/plugins/paimon/`, e.g.,
for OSS filesystem support, put `paimon-oss-<paimon_version>.jar`
Review Comment:
Not change this doc. I don't want to mix other formats in this doc since
it'll make it hard to follow this guidance.
##########
website/docs/maintenance/tiered-storage/lakehouse-storage.md:
##########
@@ -58,10 +60,17 @@ Then, you must start the datalake tiering service to tier
Fluss's data to the la
- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you
should choose a connector version matching your Flink version. If you're using
Flink 1.20, please use
[fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun
OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File
System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote
storage](maintenance/tiered-storage/remote-storage.md),
you should download the corresponding [Fluss filesystem
jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
-- Put [fluss-lake-paimon
jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar)
into `${FLINK_HOME}/lib`
+- For Paimon integration:
Review Comment:
dito. Not change this for it make hard to follow
##########
website/docs/maintenance/configuration.md:
##########
@@ -163,9 +163,9 @@ during the Fluss cluster working.
## Lakehouse
-| Option | Type | Default | Description
|
-|-----------------|------|---------|---------------------------------------------------------------------------------------------------------------------------|
-| datalake.format | Enum | (None) | The datalake format used by of Fluss to
be as lakehouse storage, such as Paimon, Iceberg, Hudi. Now, only support
Paimon. |
+| Option | Type | Default | Description
|
+|-----------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| datalake.format | Enum | (None) | The datalake format used by of Fluss to
be as lakehouse storage. Currently, supported formats are Paimon, Iceberg, and
Lance. In the future, more kinds of data lake format will be supported, such as
DeltaLake or Hudi. |
Review Comment:
Also to update it in `ConfigOptions`
##########
website/docs/engine-flink/options.md:
##########
@@ -60,29 +60,29 @@ ALTER TABLE log_table SET ('table.log.ttl' = '7d');
## Storage Options
-| Option | Type | Default
| Description
|
-|-----------------------------------------|----------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| bucket.num | int | The bucket number of
Fluss cluster. | The number of buckets of a Fluss table.
|
-| bucket.key | String | (None)
| Specific the distribution policy of the Fluss table. Data will be
distributed to each bucket according to the hash value of bucket-key (It must
be a subset of the primary keys excluding partition keys of the primary key
table). If you specify multiple fields, delimiter is `,`. If the table has a
primary key and a bucket key is not specified, the bucket key will be used as
primary key(excluding the partition key). If the table has no primary key and
the bucket key is not specified, the data will be distributed to each bucket
randomly.
|
-| table.log.ttl | Duration | 7 days
| The time to live for log segments. The configuration controls the
maximum time we will retain a log before we will delete old segments to free up
space. If set to -1, the log will not be deleted.
|
-| table.auto-partition.enabled | Boolean | false
| Whether enable auto partition for the table. Disable by default.
When auto partition is enabled, the partitions of the table will be created
automatically.
|
-| table.auto-partition.key | String | (None)
| This configuration defines the time-based partition key to be
used for auto-partitioning when a table is partitioned with multiple keys.
Auto-partitioning utilizes a time-based partition key to handle partitions
automatically, including creating new ones and removing outdated ones, by
comparing the time value of the partition with the current system time. In the
case of a table using multiple partition keys (such as a composite partitioning
strategy), this feature determines which key should serve as the primary time
dimension for making auto-partitioning decisions. And If the table has only one
partition key, this config is not necessary. Otherwise, it must be specified.
|
-| table.auto-partition.time-unit | ENUM | DAY
| The time granularity for auto created partitions. The default
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the
value is `MONTH`, the partition format for auto created is yyyyMM. If the value
is `QUARTER`, the partition format for auto created is yyyyQ. If the value is
`YEAR`, the partition format for auto created is yyyy.
|
-| table.auto-partition.num-precreate | Integer | 2
| The number of partitions to pre-create for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11 and the value is configured as 3, then partitions 20241111,
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip
creating the partition. The default value is 2, which means 2 partitions will
be pre-created. If the `table.auto-partition.time-unit` is `DAY`(default), one
precreated partition is for today and another one is for tomorrow. For a
partition table with multiple partition keys, pre-create is unsupported and
will be set to 0 automatically when creating table if it is not explicitly
specified.
|
-| table.auto-partition.num-retention | Integer | 7
| The number of history partitions to retain for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then
the history partitions 20241108, 20241109, 20241110 will be retained. The
partitions earlier than 20241108 will be deleted. The default value is 7, which
means that 7 partitions will be retained.
|
-| table.auto-partition.time-zone | String | the system time zone
| The time zone for auto partitions, which is by default the same
as the system time zone.
|
-| table.replication.factor | Integer | (None)
| The replication factor for the log of the new table. When it's
not set, Fluss will use the cluster's default replication factor configured by
default.replication.factor. It should be a positive number and not larger than
the number of tablet servers in the Fluss cluster. A value larger than the
number of tablet servers in Fluss cluster will result in an error when the new
table is created.
|
-| table.log.format | Enum | ARROW
| The format of the log records in log store. The default value is
`ARROW`. The supported formats are `ARROW` and `INDEXED`.
|
-| table.log.arrow.compression.type | Enum | ZSTD
| The compression type of the log records if the log format is set
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The
default value is `ZSTD`.
|
-| table.log.arrow.compression.zstd.level | Integer | 3
| The compression level of the log records if the log format is set
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to
22. The default value is 3.
|
-| table.kv.format | Enum | COMPACTED
| The format of the kv records in kv store. The default value is
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.
|
-| table.log.tiered.local-segments | Integer | 2
| The number of log segments to retain in local for each table when
log tiered storage is enabled. It must be greater that 0. The default is 2.
|
-| table.datalake.enabled | Boolean | false
| Whether enable lakehouse storage for the table. Disabled by
default. When this option is set to ture and the datalake tiering service is
up, the table will be tiered and compacted into datalake format stored on
lakehouse storage.
|
-| table.datalake.format | Enum | (None)
| The data lake format of the table specifies the tiered Lakehouse
storage format, such as Paimon, Iceberg, DeltaLake, or Hudi. Currently, only
`paimon` is supported. Once the `table.datalake.format` property is configured,
Fluss adopts the key encoding and bucketing strategy used by the corresponding
data lake format. This ensures consistency in key encoding and bucketing,
enabling seamless **Union Read** functionality across Fluss and Lakehouse. The
`table.datalake.format` can be pre-defined before enabling
`table.datalake.enabled`. This allows the data lake feature to be dynamically
enabled on the table without requiring table recreation. If
`table.datalake.format` is not explicitly set during table creation, the table
will default to the format specified by the `datalake.format` configuration in
the Fluss cluster |
-| table.datalake.freshness | Duration | 3min
| It defines the maximum amount of time that the datalake table's
content should lag behind updates to the Fluss table. Based on this target
freshness, the Fluss service automatically moves data from the Fluss table and
updates to the datalake table, so that the data in the datalake table is kept
up to date within this target. If the data does not need to be as fresh, you
can specify a longer target freshness time to reduce costs.
|
-| table.datalake.auto-compaction | Boolean | false
| If true, compaction will be triggered automatically when tiering
service writes to the datalake. It is disabled by default.
|
-| table.merge-engine | Enum | (None)
| Defines the merge engine for the primary key table. By default,
primary key table uses the [default merge
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md).
It also supports two merge engines are `first_row` and `versioned`. The
[first_row merge
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep
the first row of the same primary key. The [versioned merge
engine](table-design/table-types/pk-table/merge-engines/versioned.md) will keep
the row with the largest version of the same primary key.
|
-| table.merge-engine.versioned.ver-column | String | (None)
| The column name of the version column for the `versioned` merge
engine. If the merge engine is set to `versioned`, the version column must be
set.
|
+| Option | Type | Default
| Description
|
+|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| bucket.num | int | The bucket number of
Fluss cluster. | The number of buckets of a Fluss table.
|
+| bucket.key | String | (None)
| Specific the distribution policy of the Fluss table. Data will be
distributed to each bucket according to the hash value of bucket-key (It must
be a subset of the primary keys excluding partition keys of the primary key
table). If you specify multiple fields, delimiter is `,`. If the table has a
primary key and a bucket key is not specified, the bucket key will be used as
primary key(excluding the partition key). If the table has no primary key and
the bucket key is not specified, the data will be distributed to each bucket
randomly.
|
+| table.log.ttl | Duration | 7 days
| The time to live for log segments. The configuration controls the
maximum time we will retain a log before we will delete old segments to free up
space. If set to -1, the log will not be deleted.
|
+| table.auto-partition.enabled | Boolean | false
| Whether enable auto partition for the table. Disable by default.
When auto partition is enabled, the partitions of the table will be created
automatically.
|
+| table.auto-partition.key | String | (None)
| This configuration defines the time-based partition key to be
used for auto-partitioning when a table is partitioned with multiple keys.
Auto-partitioning utilizes a time-based partition key to handle partitions
automatically, including creating new ones and removing outdated ones, by
comparing the time value of the partition with the current system time. In the
case of a table using multiple partition keys (such as a composite partitioning
strategy), this feature determines which key should serve as the primary time
dimension for making auto-partitioning decisions. And If the table has only one
partition key, this config is not necessary. Otherwise, it must be specified.
|
+| table.auto-partition.time-unit | ENUM | DAY
| The time granularity for auto created partitions. The default
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the
value is `MONTH`, the partition format for auto created is yyyyMM. If the value
is `QUARTER`, the partition format for auto created is yyyyQ. If the value is
`YEAR`, the partition format for auto created is yyyy.
|
+| table.auto-partition.num-precreate | Integer | 2
| The number of partitions to pre-create for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11 and the value is configured as 3, then partitions 20241111,
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip
creating the partition. The default value is 2, which means 2 partitions will
be pre-created. If the `table.auto-partition.time-unit` is `DAY`(default), one
precreated partition is for today and another one is for tomorrow. For a
partition table with multiple partition keys, pre-create is unsupported and
will be set to 0 automatically when creating table if it is not explicitly
specified.
|
+| table.auto-partition.num-retention | Integer | 7
| The number of history partitions to retain for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then
the history partitions 20241108, 20241109, 20241110 will be retained. The
partitions earlier than 20241108 will be deleted. The default value is 7, which
means that 7 partitions will be retained.
|
+| table.auto-partition.time-zone | String | the system time zone
| The time zone for auto partitions, which is by default the same
as the system time zone.
|
+| table.replication.factor | Integer | (None)
| The replication factor for the log of the new table. When it's
not set, Fluss will use the cluster's default replication factor configured by
default.replication.factor. It should be a positive number and not larger than
the number of tablet servers in the Fluss cluster. A value larger than the
number of tablet servers in Fluss cluster will result in an error when the new
table is created.
|
+| table.log.format | Enum | ARROW
| The format of the log records in log store. The default value is
`ARROW`. The supported formats are `ARROW` and `INDEXED`.
|
+| table.log.arrow.compression.type | Enum | ZSTD
| The compression type of the log records if the log format is set
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The
default value is `ZSTD`.
|
+| table.log.arrow.compression.zstd.level | Integer | 3
| The compression level of the log records if the log format is set
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to
22. The default value is 3.
|
+| table.kv.format | Enum | COMPACTED
| The format of the kv records in kv store. The default value is
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.
|
+| table.log.tiered.local-segments | Integer | 2
| The number of log segments to retain in local for each table when
log tiered storage is enabled. It must be greater that 0. The default is 2.
|
+| table.datalake.enabled | Boolean | false
| Whether enable lakehouse storage for the table. Disabled by
default. When this option is set to ture and the datalake tiering service is
up, the table will be tiered and compacted into datalake format stored on
lakehouse storage.
|
+| table.datalake.format | Enum | (None)
| The data lake format of the table specifies the tiered Lakehouse
storage format. Currently, supported formats are `paimon`, `iceberg`, and
`lance`. In the future, more kinds of data lake format will be supported, such
as DeltaLake or Hudi. Once the `table.datalake.format` property is configured,
Fluss adopts the key encoding and bucketing strategy used by the corresponding
data lake format. This ensures consistency in key encoding and bucketing,
enabling seamless **Union Read** functionality across Fluss and Lakehouse. The
`table.datalake.format` can be pre-defined before enabling
`table.datalake.enabled`. This allows the data lake feature to be dynamically
enabled on the table without requiring table recreation. If
`table.datalake.format` is not explicitly set during table creation, the table
will default to the format specified by the `datalake.format` configuration in
the Fluss cluster. |
Review Comment:
Also to update it in ConfigOptions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
