Sorry, "no relation w/JDBC" in my previous message should read "no relation w/JDBC batching".
— Alex 10 дек. 2016 г. 1:52 PM пользователь "Alexander Paschenko" < alexander.a.pasche...@gmail.com> написал: > Dima, > > I would like to point out that data streamer support had already been > implemented in the course of work on DML in 1.8 exactly as you are > suggesting now (turned on via connection flag; allowed only MERGE — data > streamer can't do putIfAbsent stuff, right?; absolutely no relation > w/JDBC), *but* that patch had been reverted — by advice from Vlad which I > believe had been agreed with you, so it didn't make it to 1.8 after all. > Also, while it's possible to maintain INSERT vs MERGE semantic using > streamer's allowOverwrite flag, I can't see how we could mimic UPDATE here > as long as the streamer does not put data to cache only in case when key is > present AND allowOverwrite is false, while UPDATE should not put anything > when the key is *missing* — i.e., there's no way to emulate cache's > *replace* operation semantic with streamer (update value only if key is > present, otherwise do nothing). > > — Alex > 9 дек. 2016 г. 10:00 PM пользователь "Dmitriy Setrakyan" < > dsetrak...@apache.org> написал: > >> On Fri, Dec 9, 2016 at 12:45 AM, Vladimir Ozerov <voze...@gridgain.com> >> wrote: >> >> > I already expressed my concern - this is counterintuitive approach. >> Because >> > without happens-before pure streaming model can be applied only on >> > independent chunks of data. It mean that mentioned ETL use case is not >> > feasible - ETL always depend on implicit or explicit links between >> tables, >> > and hence streaming is not applicable here. My question stands still - >> what >> > products except of possibly Ignite do this kind of JDBC streaming? >> > >> >> Vova, we have 2 mechanisms in the product: IgniteCache.putAll() or >> DataStreamer.addData(). >> >> JDBC batching and putAll() are absolutely identical. If you see it as >> counter-intuitive, I would ask for a concrete example. >> >> As far as links between data, Ignite does not have foreign-key >> constraints, >> so DataStreamer can insert data in any order (but again, not as part of >> JDBC batch). >> >> >> > >> > Another problem is that connection-wide property doesn't fit well in >> JDBC >> > pooling model. Users will have use different connections for streaming >> and >> > non-streaming approaches. >> > >> >> Using DataStreamer is not possible within JDBC batching paradigm, period. >> I >> wish we could drop the high-level-feels-good discussions altogether, >> because it seems like we are spinning wheels here. >> >> There is no way to use the streamer in JDBC context, unless we add a >> connection flag. Again, if you disagree, I would prefer to see a concrete >> example explaining why. >> >> >> > Please see how Oracle did that, this is precisely what I am talking >> about: >> > https://docs.oracle.com/cd/B28359_01/java.111/b31224/oraperf >> .htm#i1056232 >> > Two batching modes - one with explicit flush, another one with implicit >> > flush, when Oracle decides on it's own when it is better to communicate >> the >> > server. Batching mode can be declared globally or on per-statement >> level. >> > Simple and flexible. >> > >> > >> > On Fri, Dec 9, 2016 at 4:40 AM, Dmitriy Setrakyan < >> dsetrak...@apache.org> >> > wrote: >> > >> > > Gents, >> > > >> > > As Sergi suggested, batching and streaming are very different >> > semantically. >> > > >> > > To use standard JDBC batching, all we need to do is convert it to a >> > > cache.putAll() method, as semantically a putAll(...) call is identical >> > to a >> > > JDBC batch. Of course, if we see and UPDATE with a WHERE clause in >> > between, >> > > then we may have to break a batch into several chunks and execute the >> > > update in between. The DataStreamer should not be used here. >> > > >> > > I believe that for streaming we need to add a special JDBC/ODBC >> > connection >> > > flag. Whenever this flag is set to true, then we only should allow >> INSERT >> > > or single-UPDATE operations and use DataStreamer API internally. All >> > > operations other than INSERT or single-UPDATE should be prohibited. >> > > >> > > I think this design is semantically clear. Any objections? >> > > >> > > D. >> > > >> > > On Thu, Dec 8, 2016 at 5:02 AM, Sergi Vladykin < >> sergi.vlady...@gmail.com >> > > >> > > wrote: >> > > >> > > > If we use Streamer, then we always have `happens-before` broken. >> This >> > is >> > > > ok, because Streamer is for data loading, not for usual operating. >> > > > >> > > > We are not inventing any bicycles, just separating concerns: >> Batching >> > and >> > > > Streaming. >> > > > >> > > > My point here is that they should not depend on each other at all: >> > > Batching >> > > > can work with or without Streaming, as well as Streaming can work >> with >> > or >> > > > without Batching. >> > > > >> > > > Your proposal is a set of non-obvious rules for them to work. I see >> no >> > > > reasons for these complications. >> > > > >> > > > Sergi >> > > > >> > > > >> > > > 2016-12-08 15:49 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>: >> > > > >> > > > > Sergi, >> > > > > >> > > > > If user call single *execute() *operation, than most likely it is >> not >> > > > > batching. We should not rely on strange case where user perform >> > > batching >> > > > > without using standard and well-adopted batching JDBC API. The >> main >> > > > problem >> > > > > with streamer is that it is async and hence break happens-before >> > > > guarantees >> > > > > in a single thread: SELECT after INSERT might not return inserted >> > > value. >> > > > > >> > > > > Honestly, I do not really understand why we are trying to >> re-invent a >> > > > > bicycle here. There is standard API - let's just use it and make >> > > flexible >> > > > > enough to take advantage of IgniteDataStreamer if needed. >> > > > > >> > > > > Is there any use case which is not covered with this solution? Or >> let >> > > me >> > > > > ask from the opposite side - are there any well-known JDBC drivers >> > > which >> > > > > perform batching/streaming from non-batched update statements? >> > > > > >> > > > > Vladimir. >> > > > > >> > > > > On Thu, Dec 8, 2016 at 3:38 PM, Sergi Vladykin < >> > > sergi.vlady...@gmail.com >> > > > > >> > > > > wrote: >> > > > > >> > > > > > Vladimir, >> > > > > > >> > > > > > I see no reason to forbid Streamer usage from non-batched >> statement >> > > > > > execution. >> > > > > > It is common that users already have their ETL tools and you >> can't >> > be >> > > > > sure >> > > > > > if they use batching or not. >> > > > > > >> > > > > > Alex, >> > > > > > >> > > > > > I guess we have to decide on Streaming first and then we will >> > discuss >> > > > > > Batching separately, ok? Because this decision may become >> important >> > > for >> > > > > > batching implementation. >> > > > > > >> > > > > > Sergi >> > > > > > >> > > > > > 2016-12-08 15:31 GMT+03:00 Andrey Gura <ag...@apache.org>: >> > > > > > >> > > > > > > Alex, >> > > > > > > >> > > > > > > In most cases JdbcQueryTask should be executed locally on >> client >> > > node >> > > > > > > started by JDBC driver. >> > > > > > > >> > > > > > > JdbcQueryTask.QueryResult res = >> > > > > > > loc ? qryTask.call() : >> > > > > > > ignite.compute(ignite.cluster().forNodeId(nodeId)).call( >> > qryTask); >> > > > > > > >> > > > > > > Is it valid behavior after introducing DML functionality? >> > > > > > > >> > > > > > > In cases when user wants to execute query on specific node he >> > > should >> > > > > > > fully understand what he wants and what can go in wrong way. >> > > > > > > >> > > > > > > >> > > > > > > On Thu, Dec 8, 2016 at 3:20 PM, Alexander Paschenko >> > > > > > > <alexander.a.pasche...@gmail.com> wrote: >> > > > > > > > Sergi, >> > > > > > > > >> > > > > > > > JDBC batching might work quite differently from driver to >> > driver. >> > > > > Say, >> > > > > > > > MySQL happily rewrites queries as I had suggested in the >> > > beginning >> > > > of >> > > > > > > > this thread (it's not the only strategy, but one of the >> > possible >> > > > > > > > options) - and, BTW, would like to hear at least an opinion >> > about >> > > > it. >> > > > > > > > >> > > > > > > > On your first approach, section before streamer: you suggest >> > that >> > > > we >> > > > > > > > send single statement and multiple param sets as a single >> query >> > > > task, >> > > > > > > > am I right? (Just to make sure that I got you properly.) If >> so, >> > > do >> > > > > you >> > > > > > > > also mean that API (namely JdbcQueryTask) between server and >> > > client >> > > > > > > > should also change? Or should new API means be added to >> > > facilitate >> > > > > > > > batching tasks? >> > > > > > > > >> > > > > > > > - Alex >> > > > > > > > >> > > > > > > > 2016-12-08 15:05 GMT+03:00 Sergi Vladykin < >> > > > sergi.vlady...@gmail.com >> > > > > >: >> > > > > > > >> Guys, >> > > > > > > >> >> > > > > > > >> I discussed this feature with Dmitriy and we came to >> > conclusion >> > > > that >> > > > > > > >> batching in JDBC and Data Streaming in Ignite have >> different >> > > > > semantics >> > > > > > > and >> > > > > > > >> performance characteristics. Thus they are independent >> > features >> > > > > (they >> > > > > > > may >> > > > > > > >> work together, may separately, but this is another story). >> > > > > > > >> >> > > > > > > >> Let me explain. >> > > > > > > >> >> > > > > > > >> This is how JDBC batching works: >> > > > > > > >> - Add N sets of parameters to a prepared statement. >> > > > > > > >> - Manually execute prepared statement. >> > > > > > > >> - Repeat until all the data is loaded. >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> This is how data streamer works: >> > > > > > > >> - Keep adding data. >> > > > > > > >> - Streamer will buffer and load buffered per-node batches >> when >> > > > they >> > > > > > are >> > > > > > > big >> > > > > > > >> enough. >> > > > > > > >> - Close streamer to make sure that everything is loaded. >> > > > > > > >> >> > > > > > > >> As you can see we have a difference in semantics of when we >> > send >> > > > > data: >> > > > > > > if >> > > > > > > >> in our JDBC we will allow sending batches to nodes without >> > > calling >> > > > > > > >> `execute` (and probably we will need to make `execute` to >> > no-op >> > > > > here), >> > > > > > > then >> > > > > > > >> we are violating semantics of JDBC, if we will disallow >> this >> > > > > behavior, >> > > > > > > then >> > > > > > > >> this batching will underperform. >> > > > > > > >> >> > > > > > > >> Thus I suggest keeping these features (JDBC Batching and >> JDBC >> > > > > > > Streaming) as >> > > > > > > >> separate features. >> > > > > > > >> >> > > > > > > >> As I already said they can work together: Batching will >> batch >> > > > > > parameters >> > > > > > > >> and on `execute` they will go to the Streamer in one shot >> and >> > > > > Streamer >> > > > > > > will >> > > > > > > >> deal with the rest. >> > > > > > > >> >> > > > > > > >> Sergi >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> 2016-12-08 14:16 GMT+03:00 Vladimir Ozerov < >> > > voze...@gridgain.com >> > > > >: >> > > > > > > >> >> > > > > > > >>> Hi Alex, >> > > > > > > >>> >> > > > > > > >>> To my understanding there are two possible approaches to >> > > batching >> > > > > in >> > > > > > > JDBC >> > > > > > > >>> layer: >> > > > > > > >>> >> > > > > > > >>> 1) Rely on default batching API. Specifically >> > > > > > > >>> *PreparedStatement.addBatch()* [1] >> > > > > > > >>> and others. This is nice and clear API, users are used to >> it, >> > > and >> > > > > > it's >> > > > > > > >>> adoption will minimize user code changes when migrating >> from >> > > > other >> > > > > > JDBC >> > > > > > > >>> sources. We simply copy updates locally and then execute >> them >> > > all >> > > > > at >> > > > > > > once >> > > > > > > >>> with only a single network hop to servers. >> > *IgniteDataStreamer* >> > > > can >> > > > > > be >> > > > > > > used >> > > > > > > >>> underneath. >> > > > > > > >>> >> > > > > > > >>> 2) Or we can have separate connection flag which will move >> > all >> > > > > > > >>> INSERT/UPDATE/DELETE statements through streamer. >> > > > > > > >>> >> > > > > > > >>> I prefer the first approach >> > > > > > > >>> >> > > > > > > >>> Also we need to keep in mind that data streamer has poor >> > > > > performance >> > > > > > > when >> > > > > > > >>> adding single key-value pairs due to high overhead on >> > > concurrency >> > > > > and >> > > > > > > other >> > > > > > > >>> bookkeeping. Instead, it is better to pre-batch key-value >> > pairs >> > > > > > before >> > > > > > > >>> giving them to streamer. >> > > > > > > >>> >> > > > > > > >>> Vladimir. >> > > > > > > >>> >> > > > > > > >>> [1] >> > > > > > > >>> https://docs.oracle.com/javase/8/docs/api/java/sql/ >> > > > > > > PreparedStatement.html# >> > > > > > > >>> addBatch-- >> > > > > > > >>> >> > > > > > > >>> On Thu, Dec 8, 2016 at 1:21 PM, Alexander Paschenko < >> > > > > > > >>> alexander.a.pasche...@gmail.com> wrote: >> > > > > > > >>> >> > > > > > > >>> > Hello Igniters, >> > > > > > > >>> > >> > > > > > > >>> > One of the major improvements to DML has to be support >> of >> > > batch >> > > > > > > >>> > statements. I'd like to discuss its implementation. The >> > > > suggested >> > > > > > > >>> > approach is to rewrite given query turning it from few >> > > INSERTs >> > > > > into >> > > > > > > >>> > single statement and processing arguments accordingly. I >> > > > suggest >> > > > > > this >> > > > > > > >>> > as long as the whole point of batching is to make as >> little >> > > > > > > >>> > interactions with cluster as possible and to make >> > operations >> > > as >> > > > > > > >>> > condensed as possible, and in case of Ignite it means >> that >> > we >> > > > > > should >> > > > > > > >>> > send as little JdbcQueryTasks as possible. And, as long >> as >> > a >> > > > > query >> > > > > > > >>> > task holds single query and its arguments, this approach >> > will >> > > > not >> > > > > > > >>> > require any changes to be done to current design and >> won't >> > > > break >> > > > > > any >> > > > > > > >>> > backward compatibility - all dirty work on rewriting >> will >> > be >> > > > done >> > > > > > by >> > > > > > > >>> > JDBC driver. >> > > > > > > >>> > Without rewriting, we could introduce some new query >> task >> > for >> > > > > batch >> > > > > > > >>> > operations, but that would make impossible sending such >> > > > requests >> > > > > > from >> > > > > > > >>> > newer clients to older servers (say, servers of version >> > > 1.8.0, >> > > > > > which >> > > > > > > >>> > does not know about batching, let alone older versions). >> > > > > > > >>> > I'd like to hear comments and suggestions from the >> > community. >> > > > > > Thanks! >> > > > > > > >>> > >> > > > > > > >>> > - Alex >> > > > > > > >>> > >> > > > > > > >>> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Vladimir Ozerov >> > Senior Software Architect >> > GridGain Systems >> > www.gridgain.com >> > *+7 (960) 283 98 40* >> > >> >