vmarquez edited a comment on pull request #10546: URL: https://github.com/apache/beam/pull/10546#issuecomment-904204549
> Can we expect any breaking user API changes? No breaking API changes. >Does this PR contain all necessary tests to test a new functionality and corner cases? Yes, the two tests I added test both using readAll with a supplied RingRange (specified in the `Read<A>` passed in) and Queries (in the `Read<A>`). >@vmarquez Did you try to run it on a real distributed cluster with a real data? If yes, could you provide any details on this (if possible)? Sure! The company I'm working for now has been running this current version (of my fork) in production for ~six months. We've used it in various dataflow pipelines hitting a cluster of 5 Scylla servers. For batch runs we've gotten up to ~3M elements per second running with hundreds of dataflow worker nodes (not a common occurence but it was done!), for streaming we've only used it on a much smaller scale, but been doing up to hundreds per second on two nodes. >Do we expect any performance degradation (or improvement!) for CassandraIO.read() since a Read part was mainly rewritten? It would be great to compare. For improvements, I think this basically makes some things possible that just weren't before. Reading an entire table when the table might have billions of rows and then filtering in Ram when you only want a subset is impractical at best sometimes almost impossible. With readAll you can programatically generate thousands or tens of thousands of queries (or RingRange) for your subest and send it to readAll instead. https://github.com/vmarquez/beam/blob/feature/BEAM-9008/cassandraio_readall/sdks/java/io/cassandra/src/test/java/org/apache/beam/sdk/io/cassandra/CassandraIOTest.java#L198 I don't expect any performance degredation (after I make one final change Ismeal has asked for): Previously the group of ring ranges that were queried were done with async calls. This doesn't make a lot of sense to me since in essence it 'undoes' the grouping of the ring ranges imo. I talk about this in the final chance @iemejia asked for. One of the ways we are able to achieve such good performance for large batch jobs with this current change is due to being able to linearize the queries and not overload a specific shard (or core, in Scylla's case). Because we can pass in a Set of RingRanges to a single Read<A> that might hit the readAll, you can (with a bit more code not provided here) in essence 'group' multiple Read<A> 'queries' into a single Read<A> query (by deriving the ringrange a query hits). This is useful as if you are programatically generating 10,000 queries for instance, you may want to group by what shard they are hitting for optimal performance. >Have been all PR's comments addressed? If not, what is missing? https://github.com/vmarquez/beam/blob/feature/BEAM-9008/cassandraio_readall/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java#L418 @iemejia wanted me to flatten the results here so to mitgiate the change from async to sync DB calls: The idea being that instead of a single `processElement` call having a number of ring ranges to query (but async), instead we'll have multiple calls to `processElement` each of which is a sync call. I agree that is probably most similar to the current behavior and I'm fine making that change. Happy to hear other thoughts or suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
