Re: Batch DML queries design discussion

Vladimir Ozerov Fri, 09 Dec 2016 00:46:55 -0800

I already expressed my concern - this is counterintuitive approach. Because
without happens-before pure streaming model can be applied only on
independent chunks of data. It mean that mentioned ETL use case is not
feasible - ETL always depend on implicit or explicit links between tables,
and hence streaming is not applicable here. My question stands still - what
products except of possibly Ignite do this kind of JDBC streaming?


Another problem is that connection-wide property doesn't fit well in JDBC
pooling model. Users will have use different connections for streaming and
non-streaming approaches.

Please see how Oracle did that, this is precisely what I am talking about:
https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf.htm#i1056232
Two batching modes - one with explicit flush, another one with implicit
flush, when Oracle decides on it's own when it is better to communicate the
server. Batching mode can be declared globally or on per-statement level.
Simple and flexible.


On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <[email protected]>
wrote:

> Gents,
>
> As Sergi suggested, batching and streaming are very different semantically.
>
> To use standard JDBC batching, all we need to do is convert it to a
> cache.putAll() method, as semantically a putAll(...) call is identical to a
> JDBC batch. Of course, if we see and UPDATE with a WHERE clause in between,
> then we may have to break a batch into several chunks and execute the
> update in between. The DataStreamer should not be used here.
>
> I believe that for streaming we need to add a special JDBC/ODBC connection
> flag. Whenever this flag is set to true, then we only should allow INSERT
> or single-UPDATE operations and use DataStreamer API internally. All
> operations other than INSERT or single-UPDATE should be prohibited.
>
> I think this design is semantically clear. Any objections?
>
> D.
>
> On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <[email protected]>
> wrote:
>
> > If we use Streamer, then we always have `happens-before` broken. This is
> > ok, because Streamer is for data loading, not for usual operating.
> >
> > We are not inventing any bicycles, just separating concerns: Batching and
> > Streaming.
> >
> > My point here is that they should not depend on each other at all:
> Batching
> > can work with or without Streaming, as well as Streaming can work with or
> > without Batching.
> >
> > Your proposal is a set of non-obvious rules for them to work. I see no
> > reasons for these complications.
> >
> > Sergi
> >
> >
> > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[email protected]>:
> >
> > > Sergi,
> > >
> > > If user call single *execute() *operation, than most likely it is not
> > > batching. We should not rely on strange case where user perform
> batching
> > > without using standard and well-adopted batching JDBC API. The main
> > problem
> > > with streamer is that it is async and hence break happens-before
> > guarantees
> > > in a single thread: SELECT after INSERT might not return inserted
> value.
> > >
> > > Honestly, I do not really understand why we are trying to re-invent a
> > > bicycle here. There is standard API - let's just use it and make
> flexible
> > > enough to take advantage of IgniteDataStreamer if needed.
> > >
> > > Is there any use case which is not covered with this solution? Or let
> me
> > > ask from the opposite side - are there any well-known JDBC drivers
> which
> > > perform batching/streaming from non-batched update statements?
> > >
> > > Vladimir.
> > >
> > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Vladimir,
> > > >
> > > > I see no reason to forbid Streamer usage from non-batched statement
> > > > execution.
> > > > It is common that users already have their ETL tools and you can't be
> > > sure
> > > > if they use batching or not.
> > > >
> > > > Alex,
> > > >
> > > > I guess we have to decide on Streaming first and then we will discuss
> > > > Batching separately, ok? Because this decision may become important
> for
> > > > batching implementation.
> > > >
> > > > Sergi
> > > >
> > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[email protected]>:
> > > >
> > > > > Alex,
> > > > >
> > > > > In most cases JdbcQueryTask should be executed locally on client
> node
> > > > > started by JDBC driver.
> > > > >
> > > > > JdbcQueryTask.QueryResult res =
> > > > >     loc ? qryTask.call() :
> > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(qryTask);
> > > > >
> > > > > Is it valid behavior after introducing DML functionality?
> > > > >
> > > > > In cases when user wants to execute query on specific node he
> should
> > > > > fully understand what he wants and what can go in wrong way.
> > > > >
> > > > >
> > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> > > > > <[email protected]> wrote:
> > > > > > Sergi,
> > > > > >
> > > > > > JDBC batching might work quite differently from driver to driver.
> > > Say,
> > > > > > MySQL happily rewrites queries as I had suggested in the
> beginning
> > of
> > > > > > this thread (it's not the only strategy, but one of the possible
> > > > > > options) - and, BTW, would like to hear at least an opinion about
> > it.
> > > > > >
> > > > > > On your first approach, section before streamer: you suggest that
> > we
> > > > > > send single statement and multiple param sets as a single query
> > task,
> > > > > > am I right? (Just to make sure that I got you properly.) If so,
> do
> > > you
> > > > > > also mean that API (namely JdbcQueryTask) between server and
> client
> > > > > > should also change? Or should new API means be added to
> facilitate
> > > > > > batching tasks?
> > > > > >
> > > > > > - Alex
> > > > > >
> > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> > [email protected]
> > > >:
> > > > > >> Guys,
> > > > > >>
> > > > > >> I discussed this feature with Dmitriy and we came to conclusion
> > that
> > > > > >> batching in JDBC and Data Streaming in Ignite have different
> > > semantics
> > > > > and
> > > > > >> performance characteristics. Thus they are independent features
> > > (they
> > > > > may
> > > > > >> work together, may separately, but this is another story).
> > > > > >>
> > > > > >> Let me explain.
> > > > > >>
> > > > > >> This is how JDBC batching works:
> > > > > >> - Add N sets of parameters to a prepared statement.
> > > > > >> - Manually execute prepared statement.
> > > > > >> - Repeat until all the data is loaded.
> > > > > >>
> > > > > >>
> > > > > >> This is how data streamer works:
> > > > > >> - Keep adding data.
> > > > > >> - Streamer will buffer and load buffered per-node batches when
> > they
> > > > are
> > > > > big
> > > > > >> enough.
> > > > > >> - Close streamer to make sure that everything is loaded.
> > > > > >>
> > > > > >> As you can see we have a difference in semantics of when we send
> > > data:
> > > > > if
> > > > > >> in our JDBC we will allow sending batches to nodes without
> calling
> > > > > >> `execute` (and probably we will need to make `execute` to no-op
> > > here),
> > > > > then
> > > > > >> we are violating semantics of JDBC, if we will disallow this
> > > behavior,
> > > > > then
> > > > > >> this batching will underperform.
> > > > > >>
> > > > > >> Thus I suggest keeping these features (JDBC Batching and JDBC
> > > > > Streaming) as
> > > > > >> separate features.
> > > > > >>
> > > > > >> As I already said they can work together: Batching will batch
> > > > parameters
> > > > > >> and on `execute` they will go to the Streamer in one shot and
> > > Streamer
> > > > > will
> > > > > >> deal with the rest.
> > > > > >>
> > > > > >> Sergi
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> [email protected]
> > >:
> > > > > >>
> > > > > >>> Hi Alex,
> > > > > >>>
> > > > > >>> To my understanding there are two possible approaches to
> batching
> > > in
> > > > > JDBC
> > > > > >>> layer:
> > > > > >>>
> > > > > >>> 1) Rely on default batching API. Specifically
> > > > > >>> *PreparedStatement.addBatch()* [1]
> > > > > >>> and others. This is nice and clear API, users are used to it,
> and
> > > > it's
> > > > > >>> adoption will minimize user code changes when migrating from
> > other
> > > > JDBC
> > > > > >>> sources. We simply copy updates locally and then execute them
> all
> > > at
> > > > > once
> > > > > >>> with only a single network hop to servers. *IgniteDataStreamer*
> > can
> > > > be
> > > > > used
> > > > > >>> underneath.
> > > > > >>>
> > > > > >>> 2) Or we can have separate connection flag which will move all
> > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> > > > > >>>
> > > > > >>> I prefer the first approach
> > > > > >>>
> > > > > >>> Also we need to keep in mind that data streamer has poor
> > > performance
> > > > > when
> > > > > >>> adding single key-value pairs due to high overhead on
> concurrency
> > > and
> > > > > other
> > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value pairs
> > > > before
> > > > > >>> giving them to streamer.
> > > > > >>>
> > > > > >>> Vladimir.
> > > > > >>>
> > > > > >>> [1]
> > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> > > > > PreparedStatement.html#
> > > > > >>> addBatch--
> > > > > >>>
> > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> > > > > >>> [email protected]> wrote:
> > > > > >>>
> > > > > >>> > Hello Igniters,
> > > > > >>> >
> > > > > >>> > One of the major improvements to DML has to be support of
> batch
> > > > > >>> > statements. I'd like to discuss its implementation. The
> > suggested
> > > > > >>> > approach is to rewrite given query turning it from few
> INSERTs
> > > into
> > > > > >>> > single statement and processing arguments accordingly. I
> > suggest
> > > > this
> > > > > >>> > as long as the whole point of batching is to make as little
> > > > > >>> > interactions with cluster as possible and to make operations
> as
> > > > > >>> > condensed as possible, and in case of Ignite it means that we
> > > > should
> > > > > >>> > send as little JdbcQueryTasks as possible. And, as long as a
> > > query
> > > > > >>> > task holds single query and its arguments, this approach will
> > not
> > > > > >>> > require any changes to be done to current design and won't
> > break
> > > > any
> > > > > >>> > backward compatibility - all dirty work on rewriting will be
> > done
> > > > by
> > > > > >>> > JDBC driver.
> > > > > >>> > Without rewriting, we could introduce some new query task for
> > > batch
> > > > > >>> > operations, but that would make impossible sending such
> > requests
> > > > from
> > > > > >>> > newer clients to older servers (say, servers of version
> 1.8.0,
> > > > which
> > > > > >>> > does not know about batching, let alone older versions).
> > > > > >>> > I'd like to hear comments and suggestions from the community.
> > > > Thanks!
> > > > > >>> >
> > > > > >>> > - Alex
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
>



-- 
Vladimir Ozerov
Senior Software Architect
GridGain Systems
www.gridgain.com
*+7 (960) 283 98 40*

Re: Batch DML queries design discussion

Reply via email to