Re: Batch DML queries design discussion

Alexander Paschenko Mon, 19 Dec 2016 08:02:40 -0800

OK folks, both data streamer support and batching support have been implemented.


Resulting design fully conforms to what Dima suggested initially -
these two concepts are separated.

Streamed statements are turned on by connection flag, stream auto
flush timeout can be tuned in the same way; these statements support
INSERT and MERGE w/o subquery as well as fast key bounded DELETE and
UPDATE; each prepared statement in streamed mode has its own streamer
object and their lifecycles are the same - on close, the statement
closes its streamer. Streaming mode is available only in "local" mode
of connection between JDBC driver and Ignite client (default mode when
JDBC driver creates Ignite client node by itself) - there would be no
sense in streaming if query args would have to travel over network.

Batched statements sre used via conventional JDBC API (setArgs...
addBatch... executeBatch...), they also support INSERT and MERGE w/o
subquery as well as fast key (and, optionally, value) bounded DELETE
and UPDATE. These work in the similar manner to non batched statements
and likewise rely on traditional putAll/invokeAll routines.
Essentially, batching is just the way to pass a bigger map to
cache.putAll without writing single very long query. This works in
local as well as "remote" Ignite JDBC connectivity mode.

More info (details are in the comments):

Batching - https://issues.apache.org/jira/browse/IGNITE-4269
Streaming - https://issues.apache.org/jira/browse/IGNITE-4169

Regards,
Alex

2016-12-10 23:39 GMT+03:00 Dmitriy Setrakyan <[email protected]>:
> Alex,
>
> It seams to me that replace semantic can be implemented with
> StreamReceiver, no?
>
> D.
>
> On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
> [email protected]> wrote:
>
>> Sorry, "no relation w/JDBC" in my previous message should read "no relation
>> w/JDBC batching".
>>
>> — Alex
>> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
>> [email protected]> написал:
>>
>> > Dima,
>> >
>> > I would like to point out that data streamer support had already been
>> > implemented in the course of work on DML in 1.8 exactly as you are
>> > suggesting now (turned on via connection flag; allowed only MERGE — data
>> > streamer can't do putIfAbsent stuff, right?; absolutely no relation
>> > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
>> > believe had been agreed with you, so it didn't make it to 1.8 after all.
>> > Also, while it's possible to maintain INSERT vs MERGE semantic using
>> > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
>> here
>> > as long as the streamer does not put data to cache only in case when key
>> is
>> > present AND allowOverwrite is false, while UPDATE should not put anything
>> > when the key is *missing* — i.e., there's no way to emulate cache's
>> > *replace* operation semantic with streamer (update value only if key is
>> > present, otherwise do nothing).
>> >
>> > — Alex
>> > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
>> > [email protected]> написал:
>> >
>> >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <[email protected]>
>> >> wrote:
>> >>
>> >> > I already expressed my concern - this is counterintuitive approach.
>> >> Because
>> >> > without happens-before pure streaming model can be applied only on
>> >> > independent chunks of data. It mean that mentioned ETL use case is not
>> >> > feasible - ETL always depend on implicit or explicit links between
>> >> tables,
>> >> > and hence streaming is not applicable here. My question stands still -
>> >> what
>> >> > products except of possibly Ignite do this kind of JDBC streaming?
>> >> >
>> >>
>> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
>> >> DataStreamer.addData().
>> >>
>> >> JDBC batching and putAll() are absolutely identical. If you see it as
>> >> counter-intuitive, I would ask for a concrete example.
>> >>
>> >> As far as links between data, Ignite does not have foreign-key
>> >> constraints,
>> >> so DataStreamer can insert data in any order (but again, not as part  of
>> >> JDBC batch).
>> >>
>> >>
>> >> >
>> >> > Another problem is that connection-wide property doesn't fit well in
>> >> JDBC
>> >> > pooling model. Users will have use different connections for streaming
>> >> and
>> >> > non-streaming approaches.
>> >> >
>> >>
>> >> Using DataStreamer is not possible within JDBC batching paradigm,
>> period.
>> >> I
>> >> wish we could drop the high-level-feels-good discussions altogether,
>> >> because it seems like we are spinning wheels here.
>> >>
>> >> There is no way to use the streamer in JDBC context, unless we add a
>> >> connection flag. Again, if you disagree, I would prefer to see a
>> concrete
>> >> example explaining why.
>> >>
>> >>
>> >> > Please see how Oracle did that, this is precisely what I am talking
>> >> about:
>> >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
>> >> .htm#i1056232
>> >> > Two batching modes - one with explicit flush, another one with
>> implicit
>> >> > flush, when Oracle decides on it's own when it is better to
>> communicate
>> >> the
>> >> > server. Batching mode can be declared globally or on per-statement
>> >> level.
>> >> > Simple and flexible.
>> >> >
>> >> >
>> >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
>> >> [email protected]>
>> >> > wrote:
>> >> >
>> >> > > Gents,
>> >> > >
>> >> > > As Sergi suggested, batching and streaming are very different
>> >> > semantically.
>> >> > >
>> >> > > To use standard JDBC batching, all we need to do is convert it to a
>> >> > > cache.putAll() method, as semantically a putAll(...) call is
>> identical
>> >> > to a
>> >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
>> >> > between,
>> >> > > then we may have to break a batch into several chunks and execute
>> the
>> >> > > update in between. The DataStreamer should not be used here.
>> >> > >
>> >> > > I believe that for streaming we need to add a special JDBC/ODBC
>> >> > connection
>> >> > > flag. Whenever this flag is set to true, then we only should allow
>> >> INSERT
>> >> > > or single-UPDATE operations and use DataStreamer API internally. All
>> >> > > operations other than INSERT or single-UPDATE should be prohibited.
>> >> > >
>> >> > > I think this design is semantically clear. Any objections?
>> >> > >
>> >> > > D.
>> >> > >
>> >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
>> >> [email protected]
>> >> > >
>> >> > > wrote:
>> >> > >
>> >> > > > If we use Streamer, then we always have `happens-before` broken.
>> >> This
>> >> > is
>> >> > > > ok, because Streamer is for data loading, not for usual operating.
>> >> > > >
>> >> > > > We are not inventing any bicycles, just separating concerns:
>> >> Batching
>> >> > and
>> >> > > > Streaming.
>> >> > > >
>> >> > > > My point here is that they should not depend on each other at all:
>> >> > > Batching
>> >> > > > can work with or without Streaming, as well as Streaming can work
>> >> with
>> >> > or
>> >> > > > without Batching.
>> >> > > >
>> >> > > > Your proposal is a set of non-obvious rules for them to work. I
>> see
>> >> no
>> >> > > > reasons for these complications.
>> >> > > >
>> >> > > > Sergi
>> >> > > >
>> >> > > >
>> >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <[email protected]
>> >:
>> >> > > >
>> >> > > > > Sergi,
>> >> > > > >
>> >> > > > > If user call single *execute() *operation, than most likely it
>> is
>> >> not
>> >> > > > > batching. We should not rely on strange case where user perform
>> >> > > batching
>> >> > > > > without using standard and well-adopted batching JDBC API. The
>> >> main
>> >> > > > problem
>> >> > > > > with streamer is that it is async and hence break happens-before
>> >> > > > guarantees
>> >> > > > > in a single thread: SELECT after INSERT might not return
>> inserted
>> >> > > value.
>> >> > > > >
>> >> > > > > Honestly, I do not really understand why we are trying to
>> >> re-invent a
>> >> > > > > bicycle here. There is standard API - let's just use it and make
>> >> > > flexible
>> >> > > > > enough to take advantage of IgniteDataStreamer if needed.
>> >> > > > >
>> >> > > > > Is there any use case which is not covered with this solution?
>> Or
>> >> let
>> >> > > me
>> >> > > > > ask from the opposite side - are there any well-known JDBC
>> drivers
>> >> > > which
>> >> > > > > perform batching/streaming from non-batched update statements?
>> >> > > > >
>> >> > > > > Vladimir.
>> >> > > > >
>> >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
>> >> > > [email protected]
>> >> > > > >
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Vladimir,
>> >> > > > > >
>> >> > > > > > I see no reason to forbid Streamer usage from non-batched
>> >> statement
>> >> > > > > > execution.
>> >> > > > > > It is common that users already have their ETL tools and you
>> >> can't
>> >> > be
>> >> > > > > sure
>> >> > > > > > if they use batching or not.
>> >> > > > > >
>> >> > > > > > Alex,
>> >> > > > > >
>> >> > > > > > I guess we have to decide on Streaming first and then we will
>> >> > discuss
>> >> > > > > > Batching separately, ok? Because this decision may become
>> >> important
>> >> > > for
>> >> > > > > > batching implementation.
>> >> > > > > >
>> >> > > > > > Sergi
>> >> > > > > >
>> >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <[email protected]>:
>> >> > > > > >
>> >> > > > > > > Alex,
>> >> > > > > > >
>> >> > > > > > > In most cases JdbcQueryTask should be executed locally on
>> >> client
>> >> > > node
>> >> > > > > > > started by JDBC driver.
>> >> > > > > > >
>> >> > > > > > > JdbcQueryTask.QueryResult res =
>> >> > > > > > >     loc ? qryTask.call() :
>> >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
>> >> > qryTask);
>> >> > > > > > >
>> >> > > > > > > Is it valid behavior after introducing DML functionality?
>> >> > > > > > >
>> >> > > > > > > In cases when user wants to execute query on specific node
>> he
>> >> > > should
>> >> > > > > > > fully understand what he wants and what can go in wrong way.
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
>> >> > > > > > > <[email protected]> wrote:
>> >> > > > > > > > Sergi,
>> >> > > > > > > >
>> >> > > > > > > > JDBC batching might work quite differently from driver to
>> >> > driver.
>> >> > > > > Say,
>> >> > > > > > > > MySQL happily rewrites queries as I had suggested in the
>> >> > > beginning
>> >> > > > of
>> >> > > > > > > > this thread (it's not the only strategy, but one of the
>> >> > possible
>> >> > > > > > > > options) - and, BTW, would like to hear at least an
>> opinion
>> >> > about
>> >> > > > it.
>> >> > > > > > > >
>> >> > > > > > > > On your first approach, section before streamer: you
>> suggest
>> >> > that
>> >> > > > we
>> >> > > > > > > > send single statement and multiple param sets as a single
>> >> query
>> >> > > > task,
>> >> > > > > > > > am I right? (Just to make sure that I got you properly.)
>> If
>> >> so,
>> >> > > do
>> >> > > > > you
>> >> > > > > > > > also mean that API (namely JdbcQueryTask) between server
>> and
>> >> > > client
>> >> > > > > > > > should also change? Or should new API means be added to
>> >> > > facilitate
>> >> > > > > > > > batching tasks?
>> >> > > > > > > >
>> >> > > > > > > > - Alex
>> >> > > > > > > >
>> >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
>> >> > > > [email protected]
>> >> > > > > >:
>> >> > > > > > > >> Guys,
>> >> > > > > > > >>
>> >> > > > > > > >> I discussed this feature with Dmitriy and we came to
>> >> > conclusion
>> >> > > > that
>> >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have
>> >> different
>> >> > > > > semantics
>> >> > > > > > > and
>> >> > > > > > > >> performance characteristics. Thus they are independent
>> >> > features
>> >> > > > > (they
>> >> > > > > > > may
>> >> > > > > > > >> work together, may separately, but this is another
>> story).
>> >> > > > > > > >>
>> >> > > > > > > >> Let me explain.
>> >> > > > > > > >>
>> >> > > > > > > >> This is how JDBC batching works:
>> >> > > > > > > >> - Add N sets of parameters to a prepared statement.
>> >> > > > > > > >> - Manually execute prepared statement.
>> >> > > > > > > >> - Repeat until all the data is loaded.
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> This is how data streamer works:
>> >> > > > > > > >> - Keep adding data.
>> >> > > > > > > >> - Streamer will buffer and load buffered per-node batches
>> >> when
>> >> > > > they
>> >> > > > > > are
>> >> > > > > > > big
>> >> > > > > > > >> enough.
>> >> > > > > > > >> - Close streamer to make sure that everything is loaded.
>> >> > > > > > > >>
>> >> > > > > > > >> As you can see we have a difference in semantics of when
>> we
>> >> > send
>> >> > > > > data:
>> >> > > > > > > if
>> >> > > > > > > >> in our JDBC we will allow sending batches to nodes
>> without
>> >> > > calling
>> >> > > > > > > >> `execute` (and probably we will need to make `execute` to
>> >> > no-op
>> >> > > > > here),
>> >> > > > > > > then
>> >> > > > > > > >> we are violating semantics of JDBC, if we will disallow
>> >> this
>> >> > > > > behavior,
>> >> > > > > > > then
>> >> > > > > > > >> this batching will underperform.
>> >> > > > > > > >>
>> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and
>> >> JDBC
>> >> > > > > > > Streaming) as
>> >> > > > > > > >> separate features.
>> >> > > > > > > >>
>> >> > > > > > > >> As I already said they can work together: Batching will
>> >> batch
>> >> > > > > > parameters
>> >> > > > > > > >> and on `execute` they will go to the Streamer in one shot
>> >> and
>> >> > > > > Streamer
>> >> > > > > > > will
>> >> > > > > > > >> deal with the rest.
>> >> > > > > > > >>
>> >> > > > > > > >> Sergi
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
>> >> > > [email protected]
>> >> > > > >:
>> >> > > > > > > >>
>> >> > > > > > > >>> Hi Alex,
>> >> > > > > > > >>>
>> >> > > > > > > >>> To my understanding there are two possible approaches to
>> >> > > batching
>> >> > > > > in
>> >> > > > > > > JDBC
>> >> > > > > > > >>> layer:
>> >> > > > > > > >>>
>> >> > > > > > > >>> 1) Rely on default batching API. Specifically
>> >> > > > > > > >>> *PreparedStatement.addBatch()* [1]
>> >> > > > > > > >>> and others. This is nice and clear API, users are used
>> to
>> >> it,
>> >> > > and
>> >> > > > > > it's
>> >> > > > > > > >>> adoption will minimize user code changes when migrating
>> >> from
>> >> > > > other
>> >> > > > > > JDBC
>> >> > > > > > > >>> sources. We simply copy updates locally and then execute
>> >> them
>> >> > > all
>> >> > > > > at
>> >> > > > > > > once
>> >> > > > > > > >>> with only a single network hop to servers.
>> >> > *IgniteDataStreamer*
>> >> > > > can
>> >> > > > > > be
>> >> > > > > > > used
>> >> > > > > > > >>> underneath.
>> >> > > > > > > >>>
>> >> > > > > > > >>> 2) Or we can have separate connection flag which will
>> move
>> >> > all
>> >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
>> >> > > > > > > >>>
>> >> > > > > > > >>> I prefer the first approach
>> >> > > > > > > >>>
>> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor
>> >> > > > > performance
>> >> > > > > > > when
>> >> > > > > > > >>> adding single key-value pairs due to high overhead on
>> >> > > concurrency
>> >> > > > > and
>> >> > > > > > > other
>> >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch
>> key-value
>> >> > pairs
>> >> > > > > > before
>> >> > > > > > > >>> giving them to streamer.
>> >> > > > > > > >>>
>> >> > > > > > > >>> Vladimir.
>> >> > > > > > > >>>
>> >> > > > > > > >>> [1]
>> >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
>> >> > > > > > > PreparedStatement.html#
>> >> > > > > > > >>> addBatch--
>> >> > > > > > > >>>
>> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
>> >> > > > > > > >>> [email protected]> wrote:
>> >> > > > > > > >>>
>> >> > > > > > > >>> > Hello Igniters,
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > One of the major improvements to DML has to be support
>> >> of
>> >> > > batch
>> >> > > > > > > >>> > statements. I'd like to discuss its implementation.
>> The
>> >> > > > suggested
>> >> > > > > > > >>> > approach is to rewrite given query turning it from few
>> >> > > INSERTs
>> >> > > > > into
>> >> > > > > > > >>> > single statement and processing arguments
>> accordingly. I
>> >> > > > suggest
>> >> > > > > > this
>> >> > > > > > > >>> > as long as the whole point of batching is to make as
>> >> little
>> >> > > > > > > >>> > interactions with cluster as possible and to make
>> >> > operations
>> >> > > as
>> >> > > > > > > >>> > condensed as possible, and in case of Ignite it means
>> >> that
>> >> > we
>> >> > > > > > should
>> >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as
>> long
>> >> as
>> >> > a
>> >> > > > > query
>> >> > > > > > > >>> > task holds single query and its arguments, this
>> approach
>> >> > will
>> >> > > > not
>> >> > > > > > > >>> > require any changes to be done to current design and
>> >> won't
>> >> > > > break
>> >> > > > > > any
>> >> > > > > > > >>> > backward compatibility - all dirty work on rewriting
>> >> will
>> >> > be
>> >> > > > done
>> >> > > > > > by
>> >> > > > > > > >>> > JDBC driver.
>> >> > > > > > > >>> > Without rewriting, we could introduce some new query
>> >> task
>> >> > for
>> >> > > > > batch
>> >> > > > > > > >>> > operations, but that would make impossible sending
>> such
>> >> > > > requests
>> >> > > > > > from
>> >> > > > > > > >>> > newer clients to older servers (say, servers of
>> version
>> >> > > 1.8.0,
>> >> > > > > > which
>> >> > > > > > > >>> > does not know about batching, let alone older
>> versions).
>> >> > > > > > > >>> > I'd like to hear comments and suggestions from the
>> >> > community.
>> >> > > > > > Thanks!
>> >> > > > > > > >>> >
>> >> > > > > > > >>> > - Alex
>> >> > > > > > > >>> >
>> >> > > > > > > >>>
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Vladimir Ozerov
>> >> > Senior Software Architect
>> >> > GridGain Systems
>> >> > www.gridgain.com
>> >> > *+7 (960) 283 98 40*
>> >> >
>> >>
>> >
>>

Re: Batch DML queries design discussion

Reply via email to