linliu-code commented on code in PR #18867: URL: https://github.com/apache/hudi/pull/18867#discussion_r3317988314
########## website/docs/hoodie_streaming_ingestion.md: ########## @@ -503,6 +505,78 @@ Check out [Kafka source config](https://hudi.apache.org/docs/configurations#Kafk Hudi Streamer also supports ingesting from Apache Pulsar via `org.apache.hudi.utilities.sources.PulsarSource`. Check out [Pulsar source config](https://hudi.apache.org/docs/configurations#Pulsar-Source-Configs) for more details. +#### Amazon Kinesis + +Use the `JsonKinesisSource` (`org.apache.hudi.utilities.sources.JsonKinesisSource`) to ingest JSON records from an AWS Kinesis Data Stream into a Hudi table. It reads from every shard in parallel, tracks per-shard progress in the Hudi Streamer checkpoint, automatically handles shard splits and merges, and de-aggregates records produced by the Kinesis Producer Library (KPL). + +##### Common configuration + +All keys use the prefix `hoodie.streamer.source.kinesis.`. The settings most users need: + +| Config key | Default | Description | +|---|---|---| +| `hoodie.streamer.source.kinesis.stream.name` | (required) | Kinesis Data Streams stream name. | +| `hoodie.streamer.source.kinesis.region` | (required) | AWS region for the stream (e.g., `us-east-1`). | +| `hoodie.streamer.source.kinesis.starting.position` | `LATEST` | Where to start when no checkpoint exists yet. `LATEST` starts at the tip of each shard; `EARLIEST` replays from `TRIM_HORIZON`. | +| `hoodie.streamer.source.kinesis.max.events` | `5000000` | Maximum number of records read per batch across all shards. Tune to control batch size. | +| `hoodie.streamer.source.kinesis.append.offsets` | `false` | When enabled, appends Kinesis metadata fields to each record: `_hoodie_kinesis_source_sequence_number`, `_hoodie_kinesis_source_shard_id`, `_hoodie_kinesis_source_partition_key`, `_hoodie_kinesis_source_timestamp`. | Review Comment: We have the config, but there is no logic to append these meta fields. So either we claim that we have follow up for it, or we don't claim this config until we have it implemented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
