[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-14 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438364#comment-16438364
 ] 

Alexander Dejanovski commented on BEAM-3485:


I've just pushed a new version that allows the user to specify the minimum 
number of desired splits.

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437588#comment-16437588
 ] 

Alexander Dejanovski commented on BEAM-3485:


# So, out of experience I know that most clusters out there are running with 16 
to 256 vnodes per node, times the number of nodes we're going to generate a lot 
of splits. Still, it would be good to be able to enforce a minimum number of 
splits if needed, so I'd be in favor of adding it as optional input. If the 
computed number of splits is lower (or if Beam fails to compute it) then we 
should fallback to the user input.
Tell me if you agree and I'll add it.
 # It is for Murmur3 but it could be good to support the RandomPartitioner 
which uses tokens between 0 and 2^127-1, which should be out of the Long span. 

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437513#comment-16437513
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] 
 1. BEAM-3424 I agree with what you suggested as a split strategy. The only 
concern for me is, as it was original cause from StackOverflow question, that 
if user runs a pipeline from local machine and Cassandra instance is located in 
different network, then we can't estimate number of splits reliably. So, for 
this case, we perhaps could to provide an option to set number of splits 
manually, though, Beam doesn't greet additional tuning knobs that are not very 
necessary. Do you think it can be another solution for this? Default options?
 2. BEAM-3425 I'm just curious if _Long_ was not enough for that?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437344#comment-16437344
 ] 

Alexander Dejanovski commented on BEAM-3485:


Neat !

It's fixed.

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437332#comment-16437332
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] Could you run this command locally? 
{{./gradlew -p sdks/java/io/cassandra build}}
I think it was caused by unused filed 
CassandraServiceImpl$CassandraReaderImpl.partitioner

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437322#comment-16437322
 ] 

Alexander Dejanovski commented on BEAM-3485:


Apparently I have test failures triggered by findbugs here : 
[https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4104/]

Any pointer on how to run the same tests locally ? Or how to check the findbugs 
report from Jenkins ?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437309#comment-16437309
 ] 

Alexander Dejanovski commented on BEAM-3485:


Hi [~aromanenko],

yes both BEAM-3424 and BEAM-3425 will be fixed by this PR.

For BEAM-3424, we will have one split per token range at least, which in most 
"modern" installs will be 256 splits per node in the cluster. For clusters that 
do not use vnodes, we'll have one split per node in the cluster in case the 
size can't be estimated.

We could refine this and add : 
 * a heuristical default of... 10 to 20 splits per node at least (for what it's 
worth)
 * and a way of enforcing a number of splits in the reader .withSplits(...)

What do you think ?

For BEAM-3425, it's fixed here as the size_estimates table was storing 
start_token and end_token as strings, not longs, and then those strings needed 
to be converted to BigInteger.

 

 

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexey Romanenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437245#comment-16437245
 ] 

Alexey Romanenko commented on BEAM-3485:


[~adejanovski] Thank you for working on this. I assume that your PR fixes other 
two issues as well - BEAM-3424 and BEAM-3425. Is it correct?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-04-13 Thread Alexander Dejanovski (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437089#comment-16437089
 ] 

Alexander Dejanovski commented on BEAM-3485:


I've created a PR to fix the split generation : 
[https://github.com/apache/beam/pull/5124]

There are other issues with how connection are established, and more precisely 
how many since there should be a single Cluster object generated per physical 
cluster and JVM, while currently we're creating Cluster objects each time one 
is needed.

I'll create follow up tickets to handle this and expand the capabilities of 
both the reader (ability to add a custom where clause) and the writer (allow to 
use PreparedStatements instead of relying on the mapper).

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-cassandra
>Reporter: Eugene Kirpichov
>Assignee: Alexey Romanenko
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-01-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328291#comment-16328291
 ] 

Jean-Baptiste Onofré commented on BEAM-3485:


I will fix that.  Thanks. I'm surprised as we provided IT (but probably not 
used).

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3485) CassandraIO.read() splitting produces invalid queries

2018-01-16 Thread Aleksandr Sosenko (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328252#comment-16328252
 ] 

Aleksandr Sosenko commented on BEAM-3485:
-

Should it really be token(pk)? I've made it work when I hardcoded real primary 
key column name instead of $pk (question on Stackoverflow is mine). Is there 
some magic, which substitutes real primary key column name instead of pk?

I can't create pull request since I don't know how to elegantly retrieve 
primary key column name in context of the split function. I could make an 
additional request to Cassandra for table description, but it seams 
overcomplicated to me. Is there other way to get primary key column name in 
context of the function?

> CassandraIO.read() splitting produces invalid queries
> -
>
> Key: BEAM-3485
> URL: https://issues.apache.org/jira/browse/BEAM-3485
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>Assignee: Jean-Baptiste Onofré
>Priority: Major
>
> See 
> [https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra/48131264?noredirect=1#comment83548442_48131264]
> As the question author points out, the error is likely that token($pk) should 
> be token(pk). This was likely masked by BEAM-3424 and BEAM-3425, and the 
> splitting code path effectively was never invoked, and was broken from the 
> first PR - so there are likely other bugs.
> When testing this issue, we must ensure good code coverage in an IT against a 
> real Cassandra instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)