pratyakshsharma commented on a change in pull request #4243: URL: https://github.com/apache/carbondata/pull/4243#discussion_r769919929
########## File path: docs/scd-and-cdc-guide.md ########## @@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, These clauses have th * Please refer example class [MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala) to understand and implement scd and cdc scenarios using APIs. * Please refer example class [DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala) to understand and implement scd and cdc scenarios using sql. -* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs. \ No newline at end of file +* Please refer example class [DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala) to understand and implement cdc using UPSERT APIs. + +### Streamer Tool + +Carbondata streamer tool is a very powerful tool for incrementally capturing change events from varied sources like kafka or DFS and merging them into target carbondata table. This essentially means one needs to integrate with external solutions like Debezium or Maxwell for moving the change events to kafka, if one wishes to capture changes from primary databases like mysql. The tool currently requires incoming data to be present in avro format and incoming schema to evolve in backwards compatible way. + +Below is a high level architecture of how the overall pipeline looks like - + + + +#### Configs + +Streamer tool exposes below configs for users to cater to their CDC use cases - + +| Parameter | Default Value | Description | +|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| carbon.streamer.target.database | (none) | The database name where the target table is present to merge the incoming data. If not given by user, system will take the current database in the spark session. | +| carbon.streamer.target.table | (none) | The target carbondata table where the data has to be merged. If this is not configured by user, the operation will fail. | +| carbon.streamer.source.type | kafka | Streamer tool currently supports two types of data sources. One can ingest data from either kafka or DFS into target carbondata table using streamer tool. | +| carbon.streamer.dfs.input.path | (none) | An absolute path on a given file system from where data needs to be read to ingest into the target carbondata table. Mandatory if the ingestion source type is DFS. | +| schema.registry.url | (none) | Streamer tool supports 2 different ways to supply schema of incoming data. Schemas can be supplied using avro files (file based schema provider) or using schema registry. This property defines the url to connect to in case schema registry is used as the schema source. | +| carbon.streamer.input.kafka.topic | (none) | This is a mandatory property to be set in case kafka is chosen as the source of data. This property defines the topics from where streamer tool will consume the data. | +| bootstrap.servers | (none) | This is another mandatory property in case kafka is chosen as the source of data. This defines the end points for kafka brokers. | +| auto.offset.reset | earliest | Streamer tool maintains checkpoints to keep a track of the incoming messages which are already consumed. In case of first ingestion using kafka source, this property defines the offset from where ingestion will start. This property can take only 2 valid values - `latest` and `earliest` | +| key.deserializer | `org.apache.kafka.common.serialization.StringDeserializer` | Any message in kafka is ultimately a key value pair in the form of serialized bytes. This property defines the deserializer to deserialize the key of a message. | +| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer` | This property defines the class which will be used for deserializing the values present in kafka topic. | Review comment: So we do not want to keep it configurable as of now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@carbondata.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org