[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

Pratyaksh Sharma (Jira) Wed, 27 Nov 2019 01:13:57 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983307#comment-16983307
 ]


Pratyaksh Sharma commented on HUDI-288:
---------------------------------------

[~xleesf] Here is what I have done - 

 

Basically we want to have single instance running for multiple tables in one 
go. But Hudi supports single topic and single target path with 1-1 mapping. To 
overcome this, we wrote a wrapper to incorporate the following topic/table 
overrides ->

1. Have separate topic for each table
2. Have separate record key and partition path for each table
3. Have separate schema-registry url for each table
4. Have separate hive_sync database and hive_sync table for every table
5. Support customised key generators for every table based on how the partition 
path field is formatted and also specify the same format as a config
6. Target base path has a one to one mapping with topic. To achieve this, 
target path was designed using the format -

<base_path>/<database to which the concerned table belongs>/<table> (this 
customization was done in wrapper after reading the args passed with 
spark-submit command)

7. Similar format is adopted for target_table_name as well, which is used at 
the time of registering the metrics.

Custom POJO TableConfig is used to maintain all table/topic specific properties 
and the corresponding JSON objects are written in the form of a list in a 
separate file. The same needs to be passed to spark-submit command using 
--files option. Wrapper reads this file and iterates over all the TableConfig 
objects. It creates one HoodieDeltaStreamer instance for each such object and 
hence does the ingestion for every topic running in a loop. This wrapper is 
scheduled via oozie. Here is a sample TableConfig object -

{
 "id": <string in standard format to extract database and table name for the 
concerned table>
 "record_key_field": "",
 "partition_key_field": "",
 "kafka_topic": "",
 "partition_input_format": "yyyy-MM-dd HH:mm:ss.S"
}

We can keep on adding more topic specific overrides as per our need and use 
case. This design supports ingestion should run once for any table and not in 
continuous mode. We can discuss further to see how we can support continuous 
mode for all the tables using this wrapper. 

Please let me know if this makes sense. 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -------------------------------------------------------------------------------------
>
>                 Key: HUDI-288
>                 URL: https://issues.apache.org/jira/browse/HUDI-288
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: Vinoth Chandar
>            Assignee: leesf
>            Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@<dev.hudi.apache.org>
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

Reply via email to