Re: Batch DML queries design discussion

Dmitriy Setrakyan Sat, 10 Dec 2016 12:41:43 -0800

Alex,

It seams to me that replace semantic can be implemented with
StreamReceiver, no?


D.

On Sat, Dec 10, 2016 at 2:54 AM, Alexander Paschenko <
alexander.a.pasche...@gmail.com> wrote:

> Sorry, "no relation w/JDBC" in my previous message should read "no relation
> w/JDBC batching".
>
> — Alex
> 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
> alexander.a.pasche...@gmail.com> написал:
>
> > Dima,
> >
> > I would like to point out that data streamer support had already been
> > implemented in the course of work on DML in 1.8 exactly as you are
> > suggesting now (turned on via connection flag; allowed only MERGE — data
> > streamer can't do putIfAbsent stuff, right?; absolutely no relation
> > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
> > believe had been agreed with you, so it didn't make it to 1.8 after all.
> > Also, while it's possible to maintain INSERT vs MERGE semantic using
> > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE
> here
> > as long as the streamer does not put data to cache only in case when key
> is
> > present AND allowOverwrite is false, while UPDATE should not put anything
> > when the key is *missing* — i.e., there's no way to emulate cache's
> > *replace* operation semantic with streamer (update value only if key is
> > present, otherwise do nothing).
> >
> > — Alex
> > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
> > dsetrak...@apache.org> написал:
> >
> >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com>
> >> wrote:
> >>
> >> > I already expressed my concern - this is counterintuitive approach.
> >> Because
> >> > without happens-before pure streaming model can be applied only on
> >> > independent chunks of data. It mean that mentioned ETL use case is not
> >> > feasible - ETL always depend on implicit or explicit links between
> >> tables,
> >> > and hence streaming is not applicable here. My question stands still -
> >> what
> >> > products except of possibly Ignite do this kind of JDBC streaming?
> >> >
> >>
> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
> >> DataStreamer.addData().
> >>
> >> JDBC batching and putAll() are absolutely identical. If you see it as
> >> counter-intuitive, I would ask for a concrete example.
> >>
> >> As far as links between data, Ignite does not have foreign-key
> >> constraints,
> >> so DataStreamer can insert data in any order (but again, not as part  of
> >> JDBC batch).
> >>
> >>
> >> >
> >> > Another problem is that connection-wide property doesn't fit well in
> >> JDBC
> >> > pooling model. Users will have use different connections for streaming
> >> and
> >> > non-streaming approaches.
> >> >
> >>
> >> Using DataStreamer is not possible within JDBC batching paradigm,
> period.
> >> I
> >> wish we could drop the high-level-feels-good discussions altogether,
> >> because it seems like we are spinning wheels here.
> >>
> >> There is no way to use the streamer in JDBC context, unless we add a
> >> connection flag. Again, if you disagree, I would prefer to see a
> concrete
> >> example explaining why.
> >>
> >>
> >> > Please see how Oracle did that, this is precisely what I am talking
> >> about:
> >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
> >> .htm#i1056232
> >> > Two batching modes - one with explicit flush, another one with
> implicit
> >> > flush, when Oracle decides on it's own when it is better to
> communicate
> >> the
> >> > server. Batching mode can be declared globally or on per-statement
> >> level.
> >> > Simple and flexible.
> >> >
> >> >
> >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
> >> dsetrak...@apache.org>
> >> > wrote:
> >> >
> >> > > Gents,
> >> > >
> >> > > As Sergi suggested, batching and streaming are very different
> >> > semantically.
> >> > >
> >> > > To use standard JDBC batching, all we need to do is convert it to a
> >> > > cache.putAll() method, as semantically a putAll(...) call is
> identical
> >> > to a
> >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
> >> > between,
> >> > > then we may have to break a batch into several chunks and execute
> the
> >> > > update in between. The DataStreamer should not be used here.
> >> > >
> >> > > I believe that for streaming we need to add a special JDBC/ODBC
> >> > connection
> >> > > flag. Whenever this flag is set to true, then we only should allow
> >> INSERT
> >> > > or single-UPDATE operations and use DataStreamer API internally. All
> >> > > operations other than INSERT or single-UPDATE should be prohibited.
> >> > >
> >> > > I think this design is semantically clear. Any objections?
> >> > >
> >> > > D.
> >> > >
> >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
> >> sergi.vlady...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > If we use Streamer, then we always have `happens-before` broken.
> >> This
> >> > is
> >> > > > ok, because Streamer is for data loading, not for usual operating.
> >> > > >
> >> > > > We are not inventing any bicycles, just separating concerns:
> >> Batching
> >> > and
> >> > > > Streaming.
> >> > > >
> >> > > > My point here is that they should not depend on each other at all:
> >> > > Batching
> >> > > > can work with or without Streaming, as well as Streaming can work
> >> with
> >> > or
> >> > > > without Batching.
> >> > > >
> >> > > > Your proposal is a set of non-obvious rules for them to work. I
> see
> >> no
> >> > > > reasons for these complications.
> >> > > >
> >> > > > Sergi
> >> > > >
> >> > > >
> >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com
> >:
> >> > > >
> >> > > > > Sergi,
> >> > > > >
> >> > > > > If user call single *execute() *operation, than most likely it
> is
> >> not
> >> > > > > batching. We should not rely on strange case where user perform
> >> > > batching
> >> > > > > without using standard and well-adopted batching JDBC API. The
> >> main
> >> > > > problem
> >> > > > > with streamer is that it is async and hence break happens-before
> >> > > > guarantees
> >> > > > > in a single thread: SELECT after INSERT might not return
> inserted
> >> > > value.
> >> > > > >
> >> > > > > Honestly, I do not really understand why we are trying to
> >> re-invent a
> >> > > > > bicycle here. There is standard API - let's just use it and make
> >> > > flexible
> >> > > > > enough to take advantage of IgniteDataStreamer if needed.
> >> > > > >
> >> > > > > Is there any use case which is not covered with this solution?
> Or
> >> let
> >> > > me
> >> > > > > ask from the opposite side - are there any well-known JDBC
> drivers
> >> > > which
> >> > > > > perform batching/streaming from non-batched update statements?
> >> > > > >
> >> > > > > Vladimir.
> >> > > > >
> >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
> >> > > sergi.vlady...@gmail.com
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Vladimir,
> >> > > > > >
> >> > > > > > I see no reason to forbid Streamer usage from non-batched
> >> statement
> >> > > > > > execution.
> >> > > > > > It is common that users already have their ETL tools and you
> >> can't
> >> > be
> >> > > > > sure
> >> > > > > > if they use batching or not.
> >> > > > > >
> >> > > > > > Alex,
> >> > > > > >
> >> > > > > > I guess we have to decide on Streaming first and then we will
> >> > discuss
> >> > > > > > Batching separately, ok? Because this decision may become
> >> important
> >> > > for
> >> > > > > > batching implementation.
> >> > > > > >
> >> > > > > > Sergi
> >> > > > > >
> >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>:
> >> > > > > >
> >> > > > > > > Alex,
> >> > > > > > >
> >> > > > > > > In most cases JdbcQueryTask should be executed locally on
> >> client
> >> > > node
> >> > > > > > > started by JDBC driver.
> >> > > > > > >
> >> > > > > > > JdbcQueryTask.QueryResult res =
> >> > > > > > >     loc ? qryTask.call() :
> >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
> >> > qryTask);
> >> > > > > > >
> >> > > > > > > Is it valid behavior after introducing DML functionality?
> >> > > > > > >
> >> > > > > > > In cases when user wants to execute query on specific node
> he
> >> > > should
> >> > > > > > > fully understand what he wants and what can go in wrong way.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
> >> > > > > > > <alexander.a.pasche...@gmail.com> wrote:
> >> > > > > > > > Sergi,
> >> > > > > > > >
> >> > > > > > > > JDBC batching might work quite differently from driver to
> >> > driver.
> >> > > > > Say,
> >> > > > > > > > MySQL happily rewrites queries as I had suggested in the
> >> > > beginning
> >> > > > of
> >> > > > > > > > this thread (it's not the only strategy, but one of the
> >> > possible
> >> > > > > > > > options) - and, BTW, would like to hear at least an
> opinion
> >> > about
> >> > > > it.
> >> > > > > > > >
> >> > > > > > > > On your first approach, section before streamer: you
> suggest
> >> > that
> >> > > > we
> >> > > > > > > > send single statement and multiple param sets as a single
> >> query
> >> > > > task,
> >> > > > > > > > am I right? (Just to make sure that I got you properly.)
> If
> >> so,
> >> > > do
> >> > > > > you
> >> > > > > > > > also mean that API (namely JdbcQueryTask) between server
> and
> >> > > client
> >> > > > > > > > should also change? Or should new API means be added to
> >> > > facilitate
> >> > > > > > > > batching tasks?
> >> > > > > > > >
> >> > > > > > > > - Alex
> >> > > > > > > >
> >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
> >> > > > sergi.vlady...@gmail.com
> >> > > > > >:
> >> > > > > > > >> Guys,
> >> > > > > > > >>
> >> > > > > > > >> I discussed this feature with Dmitriy and we came to
> >> > conclusion
> >> > > > that
> >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have
> >> different
> >> > > > > semantics
> >> > > > > > > and
> >> > > > > > > >> performance characteristics. Thus they are independent
> >> > features
> >> > > > > (they
> >> > > > > > > may
> >> > > > > > > >> work together, may separately, but this is another
> story).
> >> > > > > > > >>
> >> > > > > > > >> Let me explain.
> >> > > > > > > >>
> >> > > > > > > >> This is how JDBC batching works:
> >> > > > > > > >> - Add N sets of parameters to a prepared statement.
> >> > > > > > > >> - Manually execute prepared statement.
> >> > > > > > > >> - Repeat until all the data is loaded.
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> This is how data streamer works:
> >> > > > > > > >> - Keep adding data.
> >> > > > > > > >> - Streamer will buffer and load buffered per-node batches
> >> when
> >> > > > they
> >> > > > > > are
> >> > > > > > > big
> >> > > > > > > >> enough.
> >> > > > > > > >> - Close streamer to make sure that everything is loaded.
> >> > > > > > > >>
> >> > > > > > > >> As you can see we have a difference in semantics of when
> we
> >> > send
> >> > > > > data:
> >> > > > > > > if
> >> > > > > > > >> in our JDBC we will allow sending batches to nodes
> without
> >> > > calling
> >> > > > > > > >> `execute` (and probably we will need to make `execute` to
> >> > no-op
> >> > > > > here),
> >> > > > > > > then
> >> > > > > > > >> we are violating semantics of JDBC, if we will disallow
> >> this
> >> > > > > behavior,
> >> > > > > > > then
> >> > > > > > > >> this batching will underperform.
> >> > > > > > > >>
> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and
> >> JDBC
> >> > > > > > > Streaming) as
> >> > > > > > > >> separate features.
> >> > > > > > > >>
> >> > > > > > > >> As I already said they can work together: Batching will
> >> batch
> >> > > > > > parameters
> >> > > > > > > >> and on `execute` they will go to the Streamer in one shot
> >> and
> >> > > > > Streamer
> >> > > > > > > will
> >> > > > > > > >> deal with the rest.
> >> > > > > > > >>
> >> > > > > > > >> Sergi
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
> >> > > voze...@gridgain.com
> >> > > > >:
> >> > > > > > > >>
> >> > > > > > > >>> Hi Alex,
> >> > > > > > > >>>
> >> > > > > > > >>> To my understanding there are two possible approaches to
> >> > > batching
> >> > > > > in
> >> > > > > > > JDBC
> >> > > > > > > >>> layer:
> >> > > > > > > >>>
> >> > > > > > > >>> 1) Rely on default batching API. Specifically
> >> > > > > > > >>> *PreparedStatement.addBatch()* [1]
> >> > > > > > > >>> and others. This is nice and clear API, users are used
> to
> >> it,
> >> > > and
> >> > > > > > it's
> >> > > > > > > >>> adoption will minimize user code changes when migrating
> >> from
> >> > > > other
> >> > > > > > JDBC
> >> > > > > > > >>> sources. We simply copy updates locally and then execute
> >> them
> >> > > all
> >> > > > > at
> >> > > > > > > once
> >> > > > > > > >>> with only a single network hop to servers.
> >> > *IgniteDataStreamer*
> >> > > > can
> >> > > > > > be
> >> > > > > > > used
> >> > > > > > > >>> underneath.
> >> > > > > > > >>>
> >> > > > > > > >>> 2) Or we can have separate connection flag which will
> move
> >> > all
> >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
> >> > > > > > > >>>
> >> > > > > > > >>> I prefer the first approach
> >> > > > > > > >>>
> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor
> >> > > > > performance
> >> > > > > > > when
> >> > > > > > > >>> adding single key-value pairs due to high overhead on
> >> > > concurrency
> >> > > > > and
> >> > > > > > > other
> >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch
> key-value
> >> > pairs
> >> > > > > > before
> >> > > > > > > >>> giving them to streamer.
> >> > > > > > > >>>
> >> > > > > > > >>> Vladimir.
> >> > > > > > > >>>
> >> > > > > > > >>> [1]
> >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
> >> > > > > > > PreparedStatement.html#
> >> > > > > > > >>> addBatch--
> >> > > > > > > >>>
> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
> >> > > > > > > >>> alexander.a.pasche...@gmail.com> wrote:
> >> > > > > > > >>>
> >> > > > > > > >>> > Hello Igniters,
> >> > > > > > > >>> >
> >> > > > > > > >>> > One of the major improvements to DML has to be support
> >> of
> >> > > batch
> >> > > > > > > >>> > statements. I'd like to discuss its implementation.
> The
> >> > > > suggested
> >> > > > > > > >>> > approach is to rewrite given query turning it from few
> >> > > INSERTs
> >> > > > > into
> >> > > > > > > >>> > single statement and processing arguments
> accordingly. I
> >> > > > suggest
> >> > > > > > this
> >> > > > > > > >>> > as long as the whole point of batching is to make as
> >> little
> >> > > > > > > >>> > interactions with cluster as possible and to make
> >> > operations
> >> > > as
> >> > > > > > > >>> > condensed as possible, and in case of Ignite it means
> >> that
> >> > we
> >> > > > > > should
> >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as
> long
> >> as
> >> > a
> >> > > > > query
> >> > > > > > > >>> > task holds single query and its arguments, this
> approach
> >> > will
> >> > > > not
> >> > > > > > > >>> > require any changes to be done to current design and
> >> won't
> >> > > > break
> >> > > > > > any
> >> > > > > > > >>> > backward compatibility - all dirty work on rewriting
> >> will
> >> > be
> >> > > > done
> >> > > > > > by
> >> > > > > > > >>> > JDBC driver.
> >> > > > > > > >>> > Without rewriting, we could introduce some new query
> >> task
> >> > for
> >> > > > > batch
> >> > > > > > > >>> > operations, but that would make impossible sending
> such
> >> > > > requests
> >> > > > > > from
> >> > > > > > > >>> > newer clients to older servers (say, servers of
> version
> >> > > 1.8.0,
> >> > > > > > which
> >> > > > > > > >>> > does not know about batching, let alone older
> versions).
> >> > > > > > > >>> > I'd like to hear comments and suggestions from the
> >> > community.
> >> > > > > > Thanks!
> >> > > > > > > >>> >
> >> > > > > > > >>> > - Alex
> >> > > > > > > >>> >
> >> > > > > > > >>>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Vladimir Ozerov
> >> > Senior Software Architect
> >> > GridGain Systems
> >> > www.gridgain.com
> >> > *+7 (960) 283 98 40*
> >> >
> >>
> >
>

Re: Batch DML queries design discussion

Reply via email to