[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs

GitBox Wed, 15 Dec 2021 10:58:29 -0800


pratyakshsharma commented on a change in pull request #4243:
URL: https://github.com/apache/carbondata/pull/4243#discussion_r769919929




##########
File path: docs/scd-and-cdc-guide.md
##########
@@ -131,4 +131,88 @@ clauses can have at most one UPDATE and one DELETE action, 
These clauses have th
 
 * Please refer example class 
[MergeTestCase](https://github.com/apache/carbondata/blob/master/integration/spark/src/test/scala/org/apache/carbondata/spark/testsuite/merge/MergeTestCase.scala)
 to understand and implement scd and cdc scenarios using APIs.
 * Please refer example class 
[DataMergeIntoExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataMergeIntoExample.scala)
 to understand and implement scd and cdc scenarios using sql. 
-* Please refer example class 
[DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala)
 to understand and implement cdc using UPSERT APIs.
\ No newline at end of file
+* Please refer example class 
[DataUPSERTExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/DataUPSERTExample.scala)
 to understand and implement cdc using UPSERT APIs.
+
+### Streamer Tool
+
+Carbondata streamer tool is a very powerful tool for incrementally capturing 
change events from varied sources like kafka or DFS and merging them into 
target carbondata table. This essentially means one needs to integrate with 
external solutions like Debezium or Maxwell for moving the change events to 
kafka, if one wishes to capture changes from primary databases like mysql. The 
tool currently requires incoming data to be present in avro format and incoming 
schema to evolve in backwards compatible way.
+
+Below is a high level architecture of how the overall pipeline looks like -
+
+![Carbondata streamer tool 
pipeline](../docs/images/carbondata-streamer-tool-pipeline.png?raw=true)
+
+#### Configs
+
+Streamer tool exposes below configs for users to cater to their CDC use cases 
- 
+
+| Parameter                         | Default Value                            
                  | Description                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                               |
+|-----------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| carbon.streamer.target.database   | (none)                                   
                  | The database name where the target table is present to 
merge the incoming data. If not given by user, system will take the current 
database in the spark session.                                                  
                                                                                
                                                                                
                                                                                
                                        |
+| carbon.streamer.target.table      | (none)                                   
                  | The target carbondata table where the data has to be 
merged. If this is not configured by user, the operation will fail.             
                                                                                
                                                                                
                                                                                
                                                                                
                                      |
+| carbon.streamer.source.type       | kafka                                    
                  | Streamer tool currently supports two types of data sources. 
One can ingest data from either kafka or DFS into target carbondata table using 
streamer tool.                                                                  
                                                                                
                                                                                
                                                                                
                      |
+| carbon.streamer.dfs.input.path    | (none)                                   
                  | An absolute path on a given file system from where data 
needs to be read to ingest into the target carbondata table. Mandatory if the 
ingestion source type is DFS.                                                   
                                                                                
                                                                                
                                                                                
                                     |
+| schema.registry.url               | (none)                                   
                  | Streamer tool supports 2 different ways to supply schema of 
incoming data. Schemas can be supplied using avro files (file based schema 
provider) or using schema registry. This property defines the url to connect to 
in case schema registry is used as the schema source.                           
                                                                                
                                                                                
                                    |
+| carbon.streamer.input.kafka.topic | (none)                                   
                  | This is a mandatory property to be set in case kafka is 
chosen as the source of data. This property defines the topics from where 
streamer tool will consume the data.                                            
                                                                                
                                                                                
                                                                                
                                         |
+| bootstrap.servers                 | (none)                                   
                  | This is another mandatory property in case kafka is chosen 
as the source of data. This defines the end points for kafka brokers.           
                                                                                
                                                                                
                                                                                
                                                                                
                                |
+| auto.offset.reset | earliest                                                 
  | Streamer tool maintains checkpoints to keep a track of the incoming 
messages which are already consumed. In case of first ingestion using kafka 
source, this property defines the offset from where ingestion will start. This 
property can take only 2 valid values - `latest` and `earliest`                 
                                                                                
                                                                                
                            |
+| key.deserializer | 
`org.apache.kafka.common.serialization.StringDeserializer` | Any message in 
kafka is ultimately a key value pair in the form of serialized bytes. This 
property defines the deserializer to deserialize the key of a message.          
                                                                                
                                                                                
                                                                                
                                                                                
 |
+| value.deserializer | `io.confluent.kafka.serializers.KafkaAvroDeserializer`  
   | This property defines the class which will be used for deserializing the 
values present in kafka topic.                                                  
                                                                                
                                                                                
                                                                                
                                                                                
                  |

Review comment:
       So we do not want to keep it configurable as of now from user's point of 
view?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@carbondata.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [carbondata] pratyakshsharma commented on a change in pull request #4243: [CARBONDATA-4308]: added docs for streamer tool configs

Reply via email to