[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-12-02 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986680#comment-16986680
 ] 

leesf commented on HUDI-288:


> Thank you for letting me drive this work. I was thinking if we should add 
> documentation for this tool as well, that will help a lot of users quickly 
> adopt Hudi for data ingestion. 

Would be nice to have some docs. And assigned to you.

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-12-02 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986664#comment-16986664
 ] 

Pratyaksh Sharma commented on HUDI-288:
---

[~xleesf] Thank you for letting me drive this work. I was thinking if we should 
add documentation for this tool as well, that will help a lot of users quickly 
adopt Hudi for data ingestion. 

[~vinoth] I am not good at suggesting names. Looking forward to your 
suggestions for this tool. :)

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-12-02 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986539#comment-16986539
 ] 

Vinoth Chandar commented on HUDI-288:
-

Great! 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-12-02 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986433#comment-16986433
 ] 

leesf commented on HUDI-288:


Hi [~vinoth]. Since [~Pratyaksh] have completed most of the code, I would like 
to assign it to him and I will help to reivew the code.

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-12-02 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986264#comment-16986264
 ] 

Vinoth Chandar commented on HUDI-288:
-

> So keeping  in the target path looks a bit skeptical to me 
> because a topic name might not necessarily include table name

I agree. I was suggesting a mere sane default, we should let the user override 
as needed using a TableConfig like mechanism, if needed.. If not, by default 
table_name = topic_name seems acceptable to me. At Uber atleast, it was very 
useful for auto creating Hudi datasets based on newly added kafka topics for e.g

> then it will keep on running for the first table itself and will never pick 
> up the next table

Yes. you need a thread per DeltaSync instance.. Supporting continuous mode 
would be good for k8s deployments, where cluster setup and teardown are costly 
affairs.. Continuous mode solves the problem of managing compactions for MOR. 
For COW, running without continuous mode could be sufficient. We can phase this 
in slowly as well.  

 

So, whos going to drive this? :)  We should also give this tool a Cool name :D 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-28 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984368#comment-16984368
 ] 

leesf commented on HUDI-288:


Thanks for your sharing. Looks very comprehensive. 
I have some thoughts. Regarding point 6, the target path was designed to  
_//_, as 
discussed above with vinoth, is it resonable to _ 
`/`_ ?  Regarding point 7, would we get rid of 
oozie as introducing it to hudi might be not very resonable?  And is there any 
other considerations not supporting continous mode currently? Also, the wrapper 
seem to be able to replace the current DeltaStreamer? 


> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-27 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983307#comment-16983307
 ] 

Pratyaksh Sharma commented on HUDI-288:
---

[~xleesf] Here is what I have done - 

 

Basically we want to have single instance running for multiple tables in one 
go. But Hudi supports single topic and single target path with 1-1 mapping. To 
overcome this, we wrote a wrapper to incorporate the following topic/table 
overrides ->

1. Have separate topic for each table
2. Have separate record key and partition path for each table
3. Have separate schema-registry url for each table
4. Have separate hive_sync database and hive_sync table for every table
5. Support customised key generators for every table based on how the partition 
path field is formatted and also specify the same format as a config
6. Target base path has a one to one mapping with topic. To achieve this, 
target path was designed using the format -

// (this 
customization was done in wrapper after reading the args passed with 
spark-submit command)

7. Similar format is adopted for target_table_name as well, which is used at 
the time of registering the metrics.

Custom POJO TableConfig is used to maintain all table/topic specific properties 
and the corresponding JSON objects are written in the form of a list in a 
separate file. The same needs to be passed to spark-submit command using 
--files option. Wrapper reads this file and iterates over all the TableConfig 
objects. It creates one HoodieDeltaStreamer instance for each such object and 
hence does the ingestion for every topic running in a loop. This wrapper is 
scheduled via oozie. Here is a sample TableConfig object -

{
 "id": 
 "record_key_field": "",
 "partition_key_field": "",
 "kafka_topic": "",
 "partition_input_format": "-MM-dd HH:mm:ss.S"
}

We can keep on adding more topic specific overrides as per our need and use 
case. This design supports ingestion should run once for any table and not in 
continuous mode. We can discuss further to see how we can support continuous 
mode for all the tables using this wrapper. 

Please let me know if this makes sense. 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-26 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982492#comment-16982492
 ] 

leesf commented on HUDI-288:


[~Pratyaksh] Of cause and glad that you have implemented the similar wrapper, 
it will save a lot of time. And is it convenient for you to share PoC you have 
implemented so that we would see  whether it satisfies the needs discussed 
above.

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-26 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16982444#comment-16982444
 ] 

Pratyaksh Sharma commented on HUDI-288:
---

[~xleesf] I have written a similar wrapper for my organisation's use case. That 
wrapper is using topic level overrides as well like how [~vinoth] is 
suggesting. Please let me know if I can work with you on this or if I can be of 
any help in closing this. 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-20 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978324#comment-16978324
 ] 

leesf commented on HUDI-288:


> I think we can stick to the same whitelist/blacklist that Kafka itself uses? 

It makes sense.

> IIUC, even now, we can specifiy multiple topics as source but they get 
>written as a single Hudi dataset.

Take a look to the currently code, and find the config 
_hoodie.deltastreamer.source.kafka.topic_  to identify the topic to ingest, and 
I  think it does not support topics, so we only support configuring only one 
topic to ingest currently, any thing I missed and please correct me if I am 
wrong.

> we want to ingest kafka topics are separate Hudi datasets.  1-1 mapping 
>between a kafka topic and a hudi dataset.. I think the tool can take a 
>`--base-path-prefix` and place each hudi dataset under 
>`/`

It makes sense.

> Also we could allow topic level overrides as needed.. for deltra steamer/hudi 
>properties.. Our DFSPropertiesConfiguration class already supports includes as 
>well. 

Sorry not to understand it correctly. Could you please show more details?

 

> Are you targetting this for 0.5.1 next release? Or do you think we can pick 
>up some things already labelled for that release.

I would like to get it ready for 0.5.1.

 

 

 

 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977695#comment-16977695
 ] 

Vinoth Chandar commented on HUDI-288:
-

I think we can stick to the same whitelist/blacklist that Kafka itself uses? 

 

> should we allow users to specify multi targetBasePath while consuming many 
> topics, I think only one targetBasePath is simpler but does it make sense? 

We need to do this. IIUC, even now, we can specifiy multiple topics as source 
but they get written as a single Hudi dataset. Here, we want to ingest kafka 
topics are separate Hudi datasets.  1-1 mapping between a kafka topic and a 
hudi dataset.. I think the tool can take a `--base-path-prefix` and place each 
hudi dataset under `/`. Also we could allow topic 
level overrides as needed.. for deltra steamer/hudi properties.. Our 
DFSPropertiesConfiguration class already supports includes as well. 

 

Are you targetting this for 0.5.1 next release? Or do you think we can pick up 
some things already labelled for that release.

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-11-18 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976500#comment-16976500
 ] 

leesf commented on HUDI-288:


[~vinoth] Sorry for late feedback. After a closer look to code paths, I prefer 
the second solution that we can write a new tool that wraps the current 
DeltaStreamer, just uses the kafka topic regex to identify all topics that need 
to be ingested, and just creates one delta streamer each topic within a SINGLE 
spark application. This solution is easier compared to the first solution.

Two questions. If the topics need to be ingested do not in regex pattern, 
should we also allow users to list all topics explicitly? 
Second, in currenty data flow, the relationship of kafka topic to 
_targetBasePath _is one-to-one, should we allow users to specify multi 
targetBasePath while consuming many topics? and the same to the config 
_targetTableName_ in hive.

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-10-04 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944571#comment-16944571
 ] 

Vinoth Chandar commented on HUDI-288:
-

one additional feature that might be great to support here would be  (pasting 
from slack) 

??I was experimenting with Marmaray and found it to be very similar to the 
HoodieDeltaStreamer utility in the HUDI project. The only difference I can see 
(in terms of feature) is the error table feature in Marmaray which is currently 
not possible in the HoodieDeltaStreamer??

Hudi hands you back the records that errored out in a batch of writes, if we 
can send that to a special error dataset, that would be really good ! (I 
designed that feature back in the day in Marmaray, has not made its way into 
delta streamer yet) 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-288) Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2019-10-03 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943728#comment-16943728
 ] 

Vinoth Chandar commented on HUDI-288:
-

Great! This will help larger companies with more topics easily adopt delta 
streamer. 

> Add support for ingesting multiple kafka streams in a single DeltaStreamer 
> deployment
> -
>
> Key: HUDI-288
> URL: https://issues.apache.org/jira/browse/HUDI-288
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>
> https://lists.apache.org/thread.html/3a69934657c48b1c0d85cba223d69cb18e18cd8aaa4817c9fd72cef6@
>  has all the context



--
This message was sent by Atlassian Jira
(v8.3.4#803005)