On 21 February 2017 at 13:20, Radim Vansa <rva...@redhat.com> wrote: > On 02/21/2017 09:39 AM, Sanne Grinovero wrote: >> On 21 February 2017 at 07:37, Dan Berindei <dan.berin...@gmail.com> wrote: >>> On Mon, Feb 20, 2017 at 8:02 PM, Sanne Grinovero <sa...@infinispan.org> >>> wrote: >>>> -1 to batch removal >>>> >>>> Frankly I've always been extremely negative about the fact that >>>> batches are built on top of transactions. It's easy to find several >>>> iterations of rants of mine on this mailing list, especially fierce >>>> debates with Mircea. So I'd welcome a separation of these features. >>>> >>>> Yet, removing batching seems very wrong. I disagree that people should >>>> use Transactions instead; for many use cases it's overkill, and for >>>> others - and this is the main pain point I always had with the current >>>> design - it might make sense to have a batch (or more than one) within >>>> a transaction. >>>> I have had practical problems with needing to "flush" a single batch >>>> within a transaction as the size of the combined elements was getting >>>> too large. That doesn't imply at all that I'm ready to commit. >>>> >>> WDYM by "flush" here? I have a feeling this is nothing like our >>> batching ever was... >> I'm just listing points about why incorporating the batch concept with >> transactions is not practical. >> >> I mean that when one has to write very large amounts of data, you need to >> break it up in smaller batches; *essentially* to flush the first batch before >> starting the second one. >> So that is unrelated with atomicity and consistency, as in practice one has >> to split one business operation into multiple batches. > > You haven't explained what "flush" means. Since you separate that from > atomicity/consistency, I assume that batches on non-tx cache are just > ordered putOrRemoveAll operations, immediately visible on flush without > any atomicity. If you want batches in transactions, you probably mean > that the changes are not visible until tx commits. So you are worried > that all data modified within this transaction won't fit into single > node memory? Or is there more to that? What's a "nested batch", then? > I could see some benefits in removing repeatable-reads cached values.
Ok my bad, I thought "flush" was quite well understood as a concept but I guess you're right it deserves a bit of elaboration on a non-typical storage engine. In my mind, it means "make sure that buffers are sent to their destination", and this doesn't have to directly relate with a commit. I just want the data entry to be shipped over to the heap which will ultimately contain it. We have real use cases in various pieces of Infinispan code I wrote over the years in which we need to write to multiple keys. For the sake of example, let's use the Lucene Directory case, in which each Lucene chunk is encoded as 3 different Infinispan entries (let's not debate why or we get heavily out of topic, trust me that it's a reasonable example of a business use case - it's probably simpler than most other cases). So I want to write a first chunk, in our code that looks like: startBatch put(chunk1/A, [some large value]) put(chunk1/B, [some small metadata]) put(chunk1/C, [some small metadata]) endBatch There is no reason to use a transaction, in fact we had to disable transactions as some of these entries could be large. There also is no reason for the batch, other than optimising the latency. You also want to be able to write 2 chunks, and still be mindful for latency: startBatch put(chunk1/A,..) put(chunk1/B,..) put(chunk1/C,..) endBatch startBatch put(chunk2/A,..) put(chunk2/B,..) put(chunk2/C,..) endBatch Again, you don't want to combine entries from chunk1 with chunk2 as they are very large. The data belonging to chunk2 is not produced yet when we write chunk1, so the explict "flush" is helpful to keep the total used heap size beyond some treshold; chunk sizes are configurable. Also, it's not important among A,B,C in which order they are written, but we definitely want to make sure that no entry from chunk2 is visible before any entry of chunk1, so we prefer these batches to be separate and sequential. Finally, such operations are usually *internal* caused by some adaptation framework which could need to be coupled to user transactions. So you actually want to be able to do: - Start Tx1 startBatch put(chunk1/A,..) put(chunk1/B,..) put(chunk1/C,..) endBatch startBatch put(chunk2/A,..) put(chunk2/B,..) put(chunk2/C,..) endBatch - Commit Tx1 Which is a very reasonable expectation from anyone using a database, yet Infinispan can't do this as it would be "nesting". In the specific case of the Lucene Directory, I don't expect people to want to do this with transactions, and we definitely need an efficient way to do batching w/o having the performance penalty of transactions. Hibernate OGM would be a better example, of a framework which needs to write a sequence of (hopefully batched) operations, and wrap them all as a single Transaction. > >>>> @Pedro: the fact that some code is broken today is not relevant, when >>>> there's need for such features. Like you suggest, it had bad premises >>>> (build it on TX) so we should address that, but not throw it all out. >>>> >>> Infinispan never created nested batches: >> I know. I'm not describing the current implementation, I'm describing use >> cases which are not addressed, or in which Infinispan is clumsy to use, >> to suggest improvements. >> And I'm not asking to have nested batches. Again, just pointing >> out practical problems with the current batch design. >> >> It should be possible to run an efficient batch of operations within a >> transaction. >> Or a sequence of batches, all within the same transaction. >> N.B. you could see that as a "nested batch" in the current twisted idea that >> a batch is a transaction - which is what I'm arguing against. >> >> >>> calling startBatch() when a >>> batch was already associated with the current thread just incremented >>> a refcount, and only the final endBatch() did any work. OTOH running a >>> batch within a transaction always worked very much like suspending the >>> current transaction, starting a new one, and committing it on >>> endBatch(). So the only real difference between batching and using >>> DummyTransactionManager is that batching is limited to one cache's >>> operations, while DummyTransactionManager supports multiple resources. >>> >>> >>>> @Bela is making spot-on objections based on use cases, which need to >>>> be addressed in some way. The "atomical" operations still don't work >>>> right[1] in Infinispan and that's a big usability problem. >>>> >>> Batching never was about sending updates asynchronously. We have >>> putAllAsync() for that, which doesn't need transactions, and it's even >>> slightly more efficient without transactions. >> If you think this way, it sounds like you give no value to the application >> performance: people need more than a couple put operations. > > Removes? > > I am really asking out of curiosity, I hope these questions don't sound > ironically. > > R. > > >> >>> And atomical operations have bugs, yes, but I'm not sure how >>> implementing a new kind of batching that isn't based on transactions >>> would help with that. >>> >>> >>>> +1 to remove async TX >>>> >>>> I actually like the concept but the API should be different.. might as >>>> well remove this for now. >>>> >>>> +1 to remove the Tree module >>>> >>>> I personally never used it as you all advised against, yet it's been >>>> often requested by users; sometimes explicitly and others not so >>>> explicit, yet people often have need which would be solved by a good >>>> Tree module. >>>> I understand the reasons you all want to remove it: it's buggy. But I >>>> believe the real reason is that it should have been built on top of a >>>> proper batch API, and using real MVCC. In that case it wouldn't have >>>> been buggy, nor too hard to maintain, and might have attracted way >>>> more interest as well. >>> I think the fact that we haven't been able to build a "proper" batch >>> API using real MVCC yet is a proof to the contrary... >>> >>> >>>> I'd remove it as a temporary measure: delete the bad stuff, but >>>> hopefully it could be reintroduced built on better principles in some >>>> future? >>>> >>>> Thanks, >>>> Sanne >>>> >>>> [1] - "right" as in user expectations and actual practical use, which >>>> is currently different than in the twisted definition of "right" that >>>> the team has been using as an excuse ;-) I'll also proof that it >>>> doesn't work right according to your own twisted specs, when I find >>>> the time to finish some tests.. >>>> >>>> >>>> On 20 February 2017 at 16:48, Pedro Ruivo <pe...@infinispan.org> wrote: >>>>> >>>>> On 20-02-2017 16:12, Bela Ban wrote: >>>>>> >>>>>> On 20/02/17 17:06, Tristan Tarrant wrote: >>>>>>> Hi guys, we discussed about this a little bit in the past and this >>>>>>> morning on IRC. Here are some proposed removals: >>>>>>> >>>>>>> - Remove the async transactional modes, as they are quite pointless >>>>>>> - Remove batching: users should use transactions >>>>>> How do you make a bunch of modifications and send them asynchronously if >>>>>> both batching *and* async TXs are getting removed? >>>>> We are not removing features, we are removing broken code. >>>>> >>>>> Batching is using transactions and async transactions doesn't make sense >>>>> since infinispan has to report to TransactionManager. >>>>> >>>>> Our current asyn-tx is broken in a way that is starts to commit and >>>>> reports OK to the transaction manager. if you have a topology change or >>>>> a conflict, you will end with inconsistent data. >>>>> >>>>> So, why do we keeping this code around? >>>>> >>>>> you can still move a transaction commit to another thread if you don't >>>>> wanna wait for its commit: >>>>> >>>>> tm.begin() >>>>> doWork() >>>>> tx = tm.suspend() >>>>> new_thread { >>>>> tm.resume(tx) >>>>> tm.commit() >>>>> } >>>>> >>>>> The best thing I can think of is to keep the batching API and >>>>> re-implement it to provide an endBatchAsync() that will do the above. >>>>> >>>>>> So if someone wants to apply a unit of work *atomically* (either all >>>>>> modifications within that unit of work are applied, or none), what >>>>>> alternatives exist? >>>>>> >>>>>>> - Remove the tree module: it doesn't work properly, and uses batching >>>>>>> >>>>>>> Please cast your votes >>>>>>> >>>>>>> Tristan >>>>>>> >>>>> _______________________________________________ >>>>> infinispan-dev mailing list >>>>> infinispan-dev@lists.jboss.org >>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>>> _______________________________________________ >>>> infinispan-dev mailing list >>>> infinispan-dev@lists.jboss.org >>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >>> _______________________________________________ >>> infinispan-dev mailing list >>> infinispan-dev@lists.jboss.org >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev >> _______________________________________________ >> infinispan-dev mailing list >> infinispan-dev@lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > -- > Radim Vansa <rva...@redhat.com> > JBoss Performance Team > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev@lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev