[GitHub] [beam] vmarquez commented on pull request #10546: [BEAM-9008] Add CassandraIO readAll method

GitBox Mon, 23 Aug 2021 16:35:27 -0700


vmarquez commented on pull request #10546:
URL: https://github.com/apache/beam/pull/10546#issuecomment-904204549

> Can we expect any breaking user API changes?

No breaking API changes.

>Does this PR contain all necessary tests to test a new functionality and
corner cases?

Yes, the two tests I added test both using RingRanges and Queries.

>@vmarquez Did you try to run it on a real distributed cluster with a real
data? If yes, could you provide any details on this (if possible)?

Sure! The company I'm working for now has been running this current
version (of my fork) in production for six months. We've used it in various
dataflow pipelines on a cluster of 5 Scylla servers. For batch runs we've
gotten up to ~3M elements per second running with hundreds of dataflow worker
nodes, for streaming we've only used it on a much smaller sacle, but been doing
up to hundreds per second on two nodes.

>Do we expect any performance degradation (or improvement!) for
CassandraIO.read() since a Read part was mainly rewritten? It would be great to
compare.

For improvements, I think this basically makes some things possible that
just weren't before. If you have billions of rows it's almost untenable to
have to read the entire DB into ram and then filter if you only want a subset.
With readAll you can programatically generate thousands or tens of thousands of
queries and send it to readAll instead.

I don't expect any performance degredation (after I make one final change
Ismeal has asked for): Previously the group of ring ranges that were queried
were done with async calls. This doesn't make a lot of sense since in essence
it 'undoes' the splitting of the ring ranges (why not just make more splits and
have more sync calls?)

One of the ways we are able to achieve such good performance for large batch
jobs with this current change is due to being able to linearize the queries and
not overload a specific shard (or core, in Scylla's case).

>Have been all PR's comments addressed? If not, what is missing?

https://github.com/vmarquez/beam/blob/feature/BEAM-9008/cassandraio_readall/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java#L418

@iemejia wanted me to flatten the results here so to mitgiate the change
from async to sync DB calls: The idea being that instead of a single
`processElement` call having a number of ring ranges to query (but async),
instead we'll have multiple calls to `processElement` each of which is a sync
call.

I agree that is probably most similar to the current behavior and I'm fine
making that change. Happy to hear other thoughts or suggestions.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] vmarquez commented on pull request #10546: [BEAM-9008] Add CassandraIO readAll method

Reply via email to