Re: Batch DML queries design discussion

Alexander Paschenko Sat, 10 Dec 2016 02:55:17 -0800

Sorry, "no relation w/JDBC" in my previous message should read "no relation
w/JDBC batching".


— Alex
10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" <
alexander.a.pasche...@gmail.com> написал:

> Dima,
>
> I would like to point out that data streamer support had already been
> implemented in the course of work on DML in 1.8 exactly as you are
> suggesting now (turned on via connection flag; allowed only MERGE — data
> streamer can't do putIfAbsent stuff, right?; absolutely no relation
> w/JDBC), *but* that patch had been reverted — by advice from Vlad which I
> believe had been agreed with you, so it didn't make it to 1.8 after all.
> Also, while it's possible to maintain INSERT vs MERGE semantic using
> streamer's allowOverwrite flag, I can't see how we could mimic UPDATE here
> as long as the streamer does not put data to cache only in case when key is
> present AND allowOverwrite is false, while UPDATE should not put anything
> when the key is *missing* — i.e., there's no way to emulate cache's
> *replace* operation semantic with streamer (update value only if key is
> present, otherwise do nothing).
>
> — Alex
> 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" <
> dsetrak...@apache.org> написал:
>
>> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com>
>> wrote:
>>
>> > I already expressed my concern - this is counterintuitive approach.
>> Because
>> > without happens-before pure streaming model can be applied only on
>> > independent chunks of data. It mean that mentioned ETL use case is not
>> > feasible - ETL always depend on implicit or explicit links between
>> tables,
>> > and hence streaming is not applicable here. My question stands still -
>> what
>> > products except of possibly Ignite do this kind of JDBC streaming?
>> >
>>
>> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or
>> DataStreamer.addData().
>>
>> JDBC batching and putAll() are absolutely identical. If you see it as
>> counter-intuitive, I would ask for a concrete example.
>>
>> As far as links between data, Ignite does not have foreign-key
>> constraints,
>> so DataStreamer can insert data in any order (but again, not as part  of
>> JDBC batch).
>>
>>
>> >
>> > Another problem is that connection-wide property doesn't fit well in
>> JDBC
>> > pooling model. Users will have use different connections for streaming
>> and
>> > non-streaming approaches.
>> >
>>
>> Using DataStreamer is not possible within JDBC batching paradigm, period.
>> I
>> wish we could drop the high-level-feels-good discussions altogether,
>> because it seems like we are spinning wheels here.
>>
>> There is no way to use the streamer in JDBC context, unless we add a
>> connection flag. Again, if you disagree, I would prefer to see a concrete
>> example explaining why.
>>
>>
>> > Please see how Oracle did that, this is precisely what I am talking
>> about:
>> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf
>> .htm#i1056232
>> > Two batching modes - one with explicit flush, another one with implicit
>> > flush, when Oracle decides on it's own when it is better to communicate
>> the
>> > server. Batching mode can be declared globally or on per-statement
>> level.
>> > Simple and flexible.
>> >
>> >
>> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan <
>> dsetrak...@apache.org>
>> > wrote:
>> >
>> > > Gents,
>> > >
>> > > As Sergi suggested, batching and streaming are very different
>> > semantically.
>> > >
>> > > To use standard JDBC batching, all we need to do is convert it to a
>> > > cache.putAll() method, as semantically a putAll(...) call is identical
>> > to a
>> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in
>> > between,
>> > > then we may have to break a batch into several chunks and execute the
>> > > update in between. The DataStreamer should not be used here.
>> > >
>> > > I believe that for streaming we need to add a special JDBC/ODBC
>> > connection
>> > > flag. Whenever this flag is set to true, then we only should allow
>> INSERT
>> > > or single-UPDATE operations and use DataStreamer API internally. All
>> > > operations other than INSERT or single-UPDATE should be prohibited.
>> > >
>> > > I think this design is semantically clear. Any objections?
>> > >
>> > > D.
>> > >
>> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin <
>> sergi.vlady...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > If we use Streamer, then we always have `happens-before` broken.
>> This
>> > is
>> > > > ok, because Streamer is for data loading, not for usual operating.
>> > > >
>> > > > We are not inventing any bicycles, just separating concerns:
>> Batching
>> > and
>> > > > Streaming.
>> > > >
>> > > > My point here is that they should not depend on each other at all:
>> > > Batching
>> > > > can work with or without Streaming, as well as Streaming can work
>> with
>> > or
>> > > > without Batching.
>> > > >
>> > > > Your proposal is a set of non-obvious rules for them to work. I see
>> no
>> > > > reasons for these complications.
>> > > >
>> > > > Sergi
>> > > >
>> > > >
>> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
>> > > >
>> > > > > Sergi,
>> > > > >
>> > > > > If user call single *execute() *operation, than most likely it is
>> not
>> > > > > batching. We should not rely on strange case where user perform
>> > > batching
>> > > > > without using standard and well-adopted batching JDBC API. The
>> main
>> > > > problem
>> > > > > with streamer is that it is async and hence break happens-before
>> > > > guarantees
>> > > > > in a single thread: SELECT after INSERT might not return inserted
>> > > value.
>> > > > >
>> > > > > Honestly, I do not really understand why we are trying to
>> re-invent a
>> > > > > bicycle here. There is standard API - let's just use it and make
>> > > flexible
>> > > > > enough to take advantage of IgniteDataStreamer if needed.
>> > > > >
>> > > > > Is there any use case which is not covered with this solution? Or
>> let
>> > > me
>> > > > > ask from the opposite side - are there any well-known JDBC drivers
>> > > which
>> > > > > perform batching/streaming from non-batched update statements?
>> > > > >
>> > > > > Vladimir.
>> > > > >
>> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin <
>> > > sergi.vlady...@gmail.com
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Vladimir,
>> > > > > >
>> > > > > > I see no reason to forbid Streamer usage from non-batched
>> statement
>> > > > > > execution.
>> > > > > > It is common that users already have their ETL tools and you
>> can't
>> > be
>> > > > > sure
>> > > > > > if they use batching or not.
>> > > > > >
>> > > > > > Alex,
>> > > > > >
>> > > > > > I guess we have to decide on Streaming first and then we will
>> > discuss
>> > > > > > Batching separately, ok? Because this decision may become
>> important
>> > > for
>> > > > > > batching implementation.
>> > > > > >
>> > > > > > Sergi
>> > > > > >
>> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>:
>> > > > > >
>> > > > > > > Alex,
>> > > > > > >
>> > > > > > > In most cases JdbcQueryTask should be executed locally on
>> client
>> > > node
>> > > > > > > started by JDBC driver.
>> > > > > > >
>> > > > > > > JdbcQueryTask.QueryResult res =
>> > > > > > >     loc ? qryTask.call() :
>> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call(
>> > qryTask);
>> > > > > > >
>> > > > > > > Is it valid behavior after introducing DML functionality?
>> > > > > > >
>> > > > > > > In cases when user wants to execute query on specific node he
>> > > should
>> > > > > > > fully understand what he wants and what can go in wrong way.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko
>> > > > > > > <alexander.a.pasche...@gmail.com> wrote:
>> > > > > > > > Sergi,
>> > > > > > > >
>> > > > > > > > JDBC batching might work quite differently from driver to
>> > driver.
>> > > > > Say,
>> > > > > > > > MySQL happily rewrites queries as I had suggested in the
>> > > beginning
>> > > > of
>> > > > > > > > this thread (it's not the only strategy, but one of the
>> > possible
>> > > > > > > > options) - and, BTW, would like to hear at least an opinion
>> > about
>> > > > it.
>> > > > > > > >
>> > > > > > > > On your first approach, section before streamer: you suggest
>> > that
>> > > > we
>> > > > > > > > send single statement and multiple param sets as a single
>> query
>> > > > task,
>> > > > > > > > am I right? (Just to make sure that I got you properly.) If
>> so,
>> > > do
>> > > > > you
>> > > > > > > > also mean that API (namely JdbcQueryTask) between server and
>> > > client
>> > > > > > > > should also change? Or should new API means be added to
>> > > facilitate
>> > > > > > > > batching tasks?
>> > > > > > > >
>> > > > > > > > - Alex
>> > > > > > > >
>> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin <
>> > > > sergi.vlady...@gmail.com
>> > > > > >:
>> > > > > > > >> Guys,
>> > > > > > > >>
>> > > > > > > >> I discussed this feature with Dmitriy and we came to
>> > conclusion
>> > > > that
>> > > > > > > >> batching in JDBC and Data Streaming in Ignite have
>> different
>> > > > > semantics
>> > > > > > > and
>> > > > > > > >> performance characteristics. Thus they are independent
>> > features
>> > > > > (they
>> > > > > > > may
>> > > > > > > >> work together, may separately, but this is another story).
>> > > > > > > >>
>> > > > > > > >> Let me explain.
>> > > > > > > >>
>> > > > > > > >> This is how JDBC batching works:
>> > > > > > > >> - Add N sets of parameters to a prepared statement.
>> > > > > > > >> - Manually execute prepared statement.
>> > > > > > > >> - Repeat until all the data is loaded.
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> This is how data streamer works:
>> > > > > > > >> - Keep adding data.
>> > > > > > > >> - Streamer will buffer and load buffered per-node batches
>> when
>> > > > they
>> > > > > > are
>> > > > > > > big
>> > > > > > > >> enough.
>> > > > > > > >> - Close streamer to make sure that everything is loaded.
>> > > > > > > >>
>> > > > > > > >> As you can see we have a difference in semantics of when we
>> > send
>> > > > > data:
>> > > > > > > if
>> > > > > > > >> in our JDBC we will allow sending batches to nodes without
>> > > calling
>> > > > > > > >> `execute` (and probably we will need to make `execute` to
>> > no-op
>> > > > > here),
>> > > > > > > then
>> > > > > > > >> we are violating semantics of JDBC, if we will disallow
>> this
>> > > > > behavior,
>> > > > > > > then
>> > > > > > > >> this batching will underperform.
>> > > > > > > >>
>> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and
>> JDBC
>> > > > > > > Streaming) as
>> > > > > > > >> separate features.
>> > > > > > > >>
>> > > > > > > >> As I already said they can work together: Batching will
>> batch
>> > > > > > parameters
>> > > > > > > >> and on `execute` they will go to the Streamer in one shot
>> and
>> > > > > Streamer
>> > > > > > > will
>> > > > > > > >> deal with the rest.
>> > > > > > > >>
>> > > > > > > >> Sergi
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov <
>> > > voze...@gridgain.com
>> > > > >:
>> > > > > > > >>
>> > > > > > > >>> Hi Alex,
>> > > > > > > >>>
>> > > > > > > >>> To my understanding there are two possible approaches to
>> > > batching
>> > > > > in
>> > > > > > > JDBC
>> > > > > > > >>> layer:
>> > > > > > > >>>
>> > > > > > > >>> 1) Rely on default batching API. Specifically
>> > > > > > > >>> *PreparedStatement.addBatch()* [1]
>> > > > > > > >>> and others. This is nice and clear API, users are used to
>> it,
>> > > and
>> > > > > > it's
>> > > > > > > >>> adoption will minimize user code changes when migrating
>> from
>> > > > other
>> > > > > > JDBC
>> > > > > > > >>> sources. We simply copy updates locally and then execute
>> them
>> > > all
>> > > > > at
>> > > > > > > once
>> > > > > > > >>> with only a single network hop to servers.
>> > *IgniteDataStreamer*
>> > > > can
>> > > > > > be
>> > > > > > > used
>> > > > > > > >>> underneath.
>> > > > > > > >>>
>> > > > > > > >>> 2) Or we can have separate connection flag which will move
>> > all
>> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer.
>> > > > > > > >>>
>> > > > > > > >>> I prefer the first approach
>> > > > > > > >>>
>> > > > > > > >>> Also we need to keep in mind that data streamer has poor
>> > > > > performance
>> > > > > > > when
>> > > > > > > >>> adding single key-value pairs due to high overhead on
>> > > concurrency
>> > > > > and
>> > > > > > > other
>> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value
>> > pairs
>> > > > > > before
>> > > > > > > >>> giving them to streamer.
>> > > > > > > >>>
>> > > > > > > >>> Vladimir.
>> > > > > > > >>>
>> > > > > > > >>> [1]
>> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/
>> > > > > > > PreparedStatement.html#
>> > > > > > > >>> addBatch--
>> > > > > > > >>>
>> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko <
>> > > > > > > >>> alexander.a.pasche...@gmail.com> wrote:
>> > > > > > > >>>
>> > > > > > > >>> > Hello Igniters,
>> > > > > > > >>> >
>> > > > > > > >>> > One of the major improvements to DML has to be support
>> of
>> > > batch
>> > > > > > > >>> > statements. I'd like to discuss its implementation. The
>> > > > suggested
>> > > > > > > >>> > approach is to rewrite given query turning it from few
>> > > INSERTs
>> > > > > into
>> > > > > > > >>> > single statement and processing arguments accordingly. I
>> > > > suggest
>> > > > > > this
>> > > > > > > >>> > as long as the whole point of batching is to make as
>> little
>> > > > > > > >>> > interactions with cluster as possible and to make
>> > operations
>> > > as
>> > > > > > > >>> > condensed as possible, and in case of Ignite it means
>> that
>> > we
>> > > > > > should
>> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long
>> as
>> > a
>> > > > > query
>> > > > > > > >>> > task holds single query and its arguments, this approach
>> > will
>> > > > not
>> > > > > > > >>> > require any changes to be done to current design and
>> won't
>> > > > break
>> > > > > > any
>> > > > > > > >>> > backward compatibility - all dirty work on rewriting
>> will
>> > be
>> > > > done
>> > > > > > by
>> > > > > > > >>> > JDBC driver.
>> > > > > > > >>> > Without rewriting, we could introduce some new query
>> task
>> > for
>> > > > > batch
>> > > > > > > >>> > operations, but that would make impossible sending such
>> > > > requests
>> > > > > > from
>> > > > > > > >>> > newer clients to older servers (say, servers of version
>> > > 1.8.0,
>> > > > > > which
>> > > > > > > >>> > does not know about batching, let alone older versions).
>> > > > > > > >>> > I'd like to hear comments and suggestions from the
>> > community.
>> > > > > > Thanks!
>> > > > > > > >>> >
>> > > > > > > >>> > - Alex
>> > > > > > > >>> >
>> > > > > > > >>>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Vladimir Ozerov
>> > Senior Software Architect
>> > GridGain Systems
>> > www.gridgain.com
>> > *+7 (960) 283 98 40*
>> >
>>
>

Re: Batch DML queries design discussion

Reply via email to