Re: When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle 5 times, I stop accessing data

2023-04-03 Thread lee


Should we stop SparkContext?
| |
李杰
|
|
leedd1...@163.com
|
 Replied Message 
| From | lee |
| Date | 4/3/2023 11:09 |
| To | Sivabalan |
| Cc | dev@hudi.apache.org |
| Subject | Re: When using the HoodieDeltaStreamer, is there a corresponding 
parameter that can control the number of cycles? For example, if I cycle 5 
times, I stop accessing data |
I tried using the 
'org.apache.hudi.utilities.deltastreamer.NoNewDataTerminationStrategy' to stop 
the task, but it didn't seem to meet my expectations. I think that after it 
stops ExecutorService, the subsequent SparkContext will also stop, but now 
SparkContext will always be started and no subsequent logs will be visible.








| |
李杰
|
|
leedd1...@163.com
|
 Replied Message 
| From | Sivabalan |
| Date | 4/1/2023 01:07 |
| To |  |
| Subject | Re: When using the HoodieDeltaStreamer, is there a corresponding 
parameter that can control the number of cycles? For example, if I cycle 5 
times, I stop accessing data |
We do have Graceful termination possibility w/ deltastreamer
continuous mode. Please check here

for post write termination strategy. You can implement your own termination
strategy. Hope that helps.

On Thu, 30 Mar 2023 at 20:16, Vinoth Chandar  wrote:

I believe there is no control today. You could hack a precommit validator
and call System.exit if you want ;) (ugly, I know)

But maybe we could introduce some abstraction to do a check between loops?
or allow users to plugin some logic to decide whether to continue or exit?

Love to understand the use-case more here.

On Wed, Mar 29, 2023 at 7:32 AM lee  wrote:

When I use the HoodieDeltaStreamer, the "-- continuous" parameter: "Delta
Streamer runs in continuous mode running source match ->Transform ->Hudi
Write in loop". So I would like to ask if there are any corresponding
parameters that can control the number of cycles, such as stopping
accessing data when I cycle 5 times.



李杰
leedd1...@163.com

<
https://dashi.163.com/projects/signature-manager/detail/index.html?ftlId=1=%E6%9D%8E%E6%9D%B0=leedd1912%40163.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fsmc4215b668fdb6b5ca355a1c3319c4a0e.jpg=%5B%22leedd1912%40163.com%22%5D





--
Regards,
-Sivabalan


Re:Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread 孔维
Hi, 


Yea, we can create multiple spark input partitions per Kafka partition.


I think the write operations can handle the potentially out-of-order events, 
because before writing we need to preCombine the incoming events using 
source-ordering-field and we also need to combineAndGetUpdateValue with records 
on storage. From a business perspective, we use the combine logic to keep our 
data correct. And hudi does not require any guarantees about the ordering of 
kafka events.


I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019], 
could you help assign the JIRA to me?







At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
>Hi,
>
>Does your implementation read out offset ranges from Kafka partitions?
>which means - we can create multiple spark input partitions per Kafka
>partitions?
>if so, +1 for overall goals here.
>
>How does this affect ordering? Can you think about how/if Hudi write
>operations can handle potentially out-of-order events being read out?
>It feels like we can add a JIRA for this anyway.
>
>
>
>On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
>
>> Hi team, for the kafka source, when pulling data from kafka, the default
>> parallelism is the number of kafka partitions.
>> There are cases:
>>
>> Pulling large amount of data from kafka (eg. maxEvents=1), but the
>> # of kafka partition is not enough, the procedure of the pulling will cost
>> too much of time, even worse cause the executor OOM
>> There is huge data skew between kafka partitions, the procedure of the
>> pulling will be blocked by the slowest partition
>>
>> to solve those cases, I want to add a parameter
>> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents in
>> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
>> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
>> take effect after the hoodie.deltastreamer.kafka.source.maxEvents config.
>>
>>
>> Here is my POC of the imporvement:
>> max executor core is 128.
>> not turn the feature on
>> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>>
>>
>> turn on the feature (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>>
>>
>> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
>> 1.1 mins, can be more faster if given more cores.
>>
>> How do you think? can I file a jira issue for this?


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Pratyaksh Sharma
Hi Vinoth,

I am aligned with the first reason that you mentioned. Better to have a
separate tool to take care of this.

On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar 
wrote:

> +1
>
> I was thinking that we add a new utility and NOT extend DeltaStreamer by
> adding a Sink interface, for the following reasons
>
> - It will make it look like a generic Source => Sink ETL tool, which is
> actually not our intention to support on Hudi. There are plenty of good
> tools for that out there.
> - the config management can get bit hard to understand, since we overload
> ingest and reverse ETL into a single tool. So break it off at use-case
> level?
>
> Thoughts?
>
> David:  PMC does not have control over that. Please see unsubscribe
> instructions here. https://hudi.apache.org/community/get-involved
> Love to keep this thread about reverse streamer discussion. So kindly fork
> another thread if you want to discuss unsubscribing.
>
> On Fri, Mar 31, 2023 at 1:47 AM Davidiam  wrote:
>
> > Hello Vinoth,
> >
> > Can you please unsubscribe me?  I have been trying to unsubscribe for
> > months without success.
> >
> > Kind Regards,
> > David
> >
> > Sent from Outlook for Android
> > 
> > From: Vinoth Chandar 
> > Sent: Friday, March 31, 2023 5:09:52 AM
> > To: dev 
> > Subject: [DISCUSS] Hudi Reverse Streamer
> >
> > Hi all,
> >
> > Any interest in building a reverse streaming tool, that does the reverse
> of
> > what the DeltaStreamer tool does? It will read Hudi table incrementally
> > (only source) and write out the data to a variety of sinks - Kafka, JDBC
> > Databases, DFS.
> >
> > This has come up many times with data warehouse users. Often times, they
> > want to use Hudi to speed up or reduce costs on their data ingestion and
> > ETL (using Spark/Flink), but want to move the derived data back into a
> data
> > warehouse or an operational database for serving.
> >
> > What do you all think?
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Vinoth Chandar
+1

I was thinking that we add a new utility and NOT extend DeltaStreamer by
adding a Sink interface, for the following reasons

- It will make it look like a generic Source => Sink ETL tool, which is
actually not our intention to support on Hudi. There are plenty of good
tools for that out there.
- the config management can get bit hard to understand, since we overload
ingest and reverse ETL into a single tool. So break it off at use-case
level?

Thoughts?

David:  PMC does not have control over that. Please see unsubscribe
instructions here. https://hudi.apache.org/community/get-involved
Love to keep this thread about reverse streamer discussion. So kindly fork
another thread if you want to discuss unsubscribing.

On Fri, Mar 31, 2023 at 1:47 AM Davidiam  wrote:

> Hello Vinoth,
>
> Can you please unsubscribe me?  I have been trying to unsubscribe for
> months without success.
>
> Kind Regards,
> David
>
> Sent from Outlook for Android
> 
> From: Vinoth Chandar 
> Sent: Friday, March 31, 2023 5:09:52 AM
> To: dev 
> Subject: [DISCUSS] Hudi Reverse Streamer
>
> Hi all,
>
> Any interest in building a reverse streaming tool, that does the reverse of
> what the DeltaStreamer tool does? It will read Hudi table incrementally
> (only source) and write out the data to a variety of sinks - Kafka, JDBC
> Databases, DFS.
>
> This has come up many times with data warehouse users. Often times, they
> want to use Hudi to speed up or reduce costs on their data ingestion and
> ETL (using Spark/Flink), but want to move the derived data back into a data
> warehouse or an operational database for serving.
>
> What do you all think?
>
> Thanks
> Vinoth
>


Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread Vinoth Chandar
Hi,

Does your implementation read out offset ranges from Kafka partitions?
which means - we can create multiple spark input partitions per Kafka
partitions?
if so, +1 for overall goals here.

How does this affect ordering? Can you think about how/if Hudi write
operations can handle potentially out-of-order events being read out?
It feels like we can add a JIRA for this anyway.



On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:

> Hi team, for the kafka source, when pulling data from kafka, the default
> parallelism is the number of kafka partitions.
> There are cases:
>
> Pulling large amount of data from kafka (eg. maxEvents=1), but the
> # of kafka partition is not enough, the procedure of the pulling will cost
> too much of time, even worse cause the executor OOM
> There is huge data skew between kafka partitions, the procedure of the
> pulling will be blocked by the slowest partition
>
> to solve those cases, I want to add a parameter
> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents in
> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> take effect after the hoodie.deltastreamer.kafka.source.maxEvents config.
>
>
> Here is my POC of the imporvement:
> max executor core is 128.
> not turn the feature on
> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>
>
> turn on the feature (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>
>
> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
> 1.1 mins, can be more faster if given more cores.
>
> How do you think? can I file a jira issue for this?