Alexander Dejanovski commented on BEAM-3485:

Hi [~aromanenko],

yes both BEAM-3424 and BEAM-3425 will be fixed by this PR.

For BEAM-3424, we will have one split per token range at least, which in most 
"modern" installs will be 256 splits per node in the cluster. For clusters that 
do not use vnodes, we'll have one split per node in the cluster in case the 
size can't be estimated.

We could refine this and add : 
 * a heuristical default of... 10 to 20 splits per node at least (for what it's 
 * and a way of enforcing a number of splits in the reader .withSplits(...)

What do you think ?

For BEAM-3425, it's fixed here as the size_estimates table was storing 
start_token and end_token as strings, not longs, and then those strings needed 
to be converted to BigInteger.



> CassandraIO.read() splitting produces invalid queries
> -----------------------------------------------------
>                 Key: BEAM-3485
>                 URL: https://issues.apache.org/jira/browse/BEAM-3485
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-cassandra
>            Reporter: Eugene Kirpichov
>            Assignee: Alexey Romanenko
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.

This message was sent by Atlassian JIRA

Reply via email to