Re: Mass data batch processing

Emrul Islam Tue, 28 Feb 2017 14:08:19 -0800

Hi, I just came across this thread and realised I had a similar situation 
to deal with which I solved slightly differently.  In my case, I didn't 
need to update the target database until the end of the batch run but I 
also couldn't store everything in memory until the batch was completed. 
 Instead, I ended up using https://github.com/npgall/cqengine


1. I created a persisted disk index with CQEngine
2. As the batch ran I updated the indexes with the results
3. Once the batch was finished I went back and updated the source database 
with the results

Not sure if that helps you but it helped me (especially as it meant that 
the batch jobs became resumable)


On Tuesday, February 28, 2017 at 12:16:21 PM UTC, Jan Lessner wrote:
>
> Hello everyone
> Lukas and I started a discussion on the manual pages which Lukas proposed 
> to move over to the user group instead. The issue was: what is the perfect 
> pattern with JOOQ for a DB batch programm that is supposed to process a 
> significant amount of data - let's say: 1 million rows per night. I added a 
> real-world usecase at the bottom of this post. We assume that there are 
> good reasons why the business logic for this batch operation cannot be 
> ported to a stored procedure, so it must run in Java.
> Due to the large amount of data, you would like to select from an open DB 
> cursor (ResultSet in JDBC) without the need to produce intermediate lists 
> in memory which make your JVM run out of memory. This is perfectly 
> addressed by JOOQ with something like that:
>
> create.selectFrom(EMAILCONTACT)....stream()
>>
>
> Now a got all the source data provided as a stream. Very nice.
> For the inserts and updates we would need batched prepared statements for 
> maximum writing speed. Our Oracle DB makes 3.000 inserts per second with 
> batch prepared inserts, so we are fine with 1 million records per night. 
> And here comes the difficult part: how can I feed multiple batched prepared 
> statements from one source stream. JOOQ provides a few different features 
> for batched prepared statements (batchStore, batch, Loader API) but it 
> seems that there is a missing link to elegantly build a kind of processing 
> pipeline from the single source and the various outputs. Java streams 
> cannot be "plit" somehow, so there is no straight-forward solution. Lukas 
> brought the term "dispatching" into the discussion. Something like that 
> might be the link.
>
> Any suggestions?
>
> Regards,
> Jan
>
>
>
> The real-world usecase
> An external e-mail fullfillment system provides us a DB table about which 
> customers got send an e-mail for advertisment campaigns. Sometimes a few 
> thousand per day, sometimes a few million, depending on the marketing 
> department's actiity. A second table holds information about e-mail 
> bounces. Dependening on this information, our system has to write entries 
> in the corresponding customers' contact history as an input for the service 
> center application. If the e-mail bounced, the history entry has different 
> type and content. Additionally the e-mail bounce leads to an update of the 
> customer's e-mail account status.
> So we have a single source, consisting of
>
>    - The e-mail information records
>    - joined with the e-mail bounce information if present (i.e. outer 
>    join) and
>    - joined with the customer's e-mail account record (inner join). In 
>    case the e-mail status must be changed, we don't want to select the e-mail 
>    account separately. So the join is reasonable
>
> The results of the process are
>
>    - inserts into a customer contact history table (nearly the same 
>    amount as provided e-mail information records) and
>    - updates for e-mail accounts (an unpredictable sub-amount but still 
>    lots of)
>
>
>

-- 
You received this message because you are subscribed to the Google Groups "jOOQ 
User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Mass data batch processing

Reply via email to