vincent marquez created BEAM-9008:
-------------------------------------

             Summary: add readAll() method to CassandraIO
                 Key: BEAM-9008
                 URL: https://issues.apache.org/jira/browse/BEAM-9008
             Project: Beam
          Issue Type: New Feature
          Components: io-java-cassandra
    Affects Versions: 2.16.0
            Reporter: vincent marquez


When querying a large cassandra database, it's often *much* more useful to 
programatically generate the queries needed to to be run rather than reading 
all partitions and attempting some filtering.  

As an example:
{code:java}
public class Event { 
   @PartitionKey(0) public UUID accountId;
   @PartitionKey(1)public String yearMonthDay; 
   @ClusteringKey public UUID eventId;  
   //other data...
}{code}
If there is ten years worth of data, you may want to only query one year's 
worth.  Here each token range would represent one 'token' but all events for 
the day. 
{code:java}
Set<UUID> accounts = getRelevantAccounts();
Set<String> dateRange = generateDateRange("2018-01-01", "2019-01-01");
PCollection<TokenRange> tokens = generateTokens(accounts, dateRange); 
{code}
 

 I propose an additional _readAll()_ PTransform that can take a PCollection of 
token ranges and can return a PCollection<T> of what the query would return. 

*Question: How much code should be in common between both methods?* 
Currently the read connector already groups all partitions into a List of Token 
Ranges, so it would be simple to refactor the current read() based method to a 
'ParDo' based one and have them both share the same function.  Reasons against 
sharing code between read and readAll
 * Not having the read based method return a BoundedSource connector would mean 
losing the ability to know the size of the data returned

 * Currently the CassandraReader executes all the grouped TokenRange queries 
*asynchronously* which is (maybe?) fine when all that's happening is splitting 
up all the partition ranges but terrible for executing potentially millions of 
queries. 

 Reasons _for_ sharing code would be simplified code base and that both of the 
above issues would most likely have a negligable performance impact. 



 

 






 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to