Thanks for the response Pablo. We're currently using our own custom ParDo connector for Cassandra (specialized to Scylla's sharding algorithm) that has a 'readAll' type option and getting great results. Would you be up for taking an outside contribution that refactors the current CassandraIO connector to be of the PTransform/ParDo kind? I'm happy to give it a shot in the next week or so and send a PR on github. My username is vmarquez on both ASF and gh, I'm also fine with writing up a JIRA describing how I'd want the more flexible connector to look.
--Vincent On Wed, Oct 16, 2019 at 11:20 AM Pablo Estrada <pabl...@google.com> wrote: > Hi Vincent, > I think it makes sense to have some sort of `readAll` for CassandraIO that > can receive multiple queries, and execute each one of them. This would also > be consistent with other IOs that we have such as FileIOs. > I suspect that doing this may require rearchitecting the whole IO from a > BoundedSource-based one to a ParDo-based one - so a large change; and we'd > need to make sure that we don't lose scalability due to that change. > > Adding Ismael/JB/Etienne who've done a lot of the work on CassandraIO. > Thoughts? > -P. > > > On Mon, Oct 14, 2019 at 3:32 PM Vincent Marquez <vincent.marq...@gmail.com> > wrote: > >> Hello Pablo, thank you for the response, and apologies for the delay. I >> had some work and also wanted to prove out what I was proposing with our >> own code at my workplace. >> >> Here is a small gist of what I'm proposing. >> >> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25 >> >> I'm happy to explain more or even write up an official design doc if you >> think that would be helpful explaining things. >> >> --Vincent >> >> On 2019/10/04 18:03:23, Pablo Estrada <p...@google.com> wrote: >> > Hi Vincent!> >> > Do you think you could add some code snippets / pseudocode as to what >> this> >> > looks like? Feel free to do it on email, gist, google doc, etc?> >> > Best> >> > -P.> >> > >> > On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <vi...@gmail.com>> >> > wrote:> >> > >> > > Currently the CassandraIO connector allows a user to specify a table, >> and> >> > > the CassandraSource object generates a list of queries based on >> token> >> > > ranges of the table, along with grouping them by the token ranges.> >> > >> >> > > I often need to run (generated, sometimes a million+) queries against >> a> >> > > subset of a table. Instead of providing a filter, it is easier and >> much> >> > > more performant to supply a collection of queries along with their >> tokens> >> > > to both partition and group by, instead of letting CassandraIO >> naively run> >> > > over the entire table or with a simple filter.> >> > >> >> > > I propose in addition to the current method of supplying a table and> >> > > filter, also allowing the user to pass in a collection of queries >> and> >> > > tokens. The current way CassandraSource breaks up the table could >> be> >> > > modified to build on top of the proposed implementation to reduce >> code> >> > > duplication as well. If this sounds like an acceptable alternative >> way of> >> > > using the CassandraIO connector, I don't mind giving it a shot with a >> pull> >> > > request.> >> > >> >> > > If there is a better way of doing this, I'm eager to hear and learn.> >> > > Thanks for reading!> >> > >> >> > > > -- *-Vincent*