Hi Vincent,
I think it makes sense to have some sort of `readAll` for CassandraIO that
can receive multiple queries, and execute each one of them. This would also
be consistent with other IOs that we have such as FileIOs.
I suspect that doing this may require rearchitecting the whole IO from a
BoundedSource-based one to a ParDo-based one - so a large change; and we'd
need to make sure that we don't lose scalability due to that change.

Adding Ismael/JB/Etienne who've done a lot of the work on CassandraIO.
Thoughts?
-P.


On Mon, Oct 14, 2019 at 3:32 PM Vincent Marquez <[email protected]>
wrote:

> Hello Pablo, thank you for the response, and apologies for the delay.  I
> had some work and also wanted to prove out what I was proposing with our
> own code at my workplace.
>
> Here is a small gist of what I'm proposing.
>
> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>
> I'm happy to explain more or even write up an official design doc if you
> think that would be helpful explaining things.
>
> --Vincent
>
> On 2019/10/04 18:03:23, Pablo Estrada <[email protected]> wrote:
> > Hi Vincent!>
> > Do you think you could add some code snippets / pseudocode as to what
> this>
> > looks like? Feel free to do it on email, gist, google doc, etc?>
> > Best>
> > -P.>
> >
> > On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <[email protected]>>
> > wrote:>
> >
> > > Currently the CassandraIO connector allows a user to specify a table,
> and>
> > > the CassandraSource object generates a list of queries based on token>
> > > ranges of the table, along with grouping them by the token ranges.>
> > >>
> > > I often need to run (generated, sometimes a million+) queries against
> a>
> > > subset of a table.  Instead of providing a filter, it is easier and
> much>
> > > more performant to supply a collection of queries along with their
> tokens>
> > > to both partition and group by, instead of letting CassandraIO naively
> run>
> > > over the entire table or with a simple filter.>
> > >>
> > > I propose in addition to the current method of supplying a table and>
> > > filter, also allowing the user to pass in a collection of queries and>
> > > tokens.   The current way CassandraSource breaks up the table could
> be>
> > > modified to build on top of the proposed implementation to reduce
> code>
> > > duplication as well.  If this sounds like an acceptable alternative
> way of>
> > > using the CassandraIO connector, I don't mind giving it a shot with a
> pull>
> > > request.>
> > >>
> > > If there is a better way of doing this, I'm eager to hear and learn.>
> > > Thanks for reading!>
> > >>
> >

Reply via email to