[
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380010#comment-17380010
]
Niel Markwick commented on BEAM-11722:
--------------------------------------
I did some experiments on this, comparing:
# Reading a table using a single thread
# Reading using the pre-supplied partitions from the Spanner Partitioned Read
API
(in my test, this supplied 38 partitions, but only 3 had data)
# Splitting the key into 1000 key-ranges, shuffling them, and then reading
these 1000 in parallel using multiple workers.
1) was obviously the slowest.
2) and 3) were about equally fast to read the actual data - implying that
despite the relatively low number of populated partitions, this was the most
efficient way to read the data from the database.
Background information: Spanner stores the tables by creating blocks of
key-ranges and assigning ownership of these key-ranges (splits) to specific
Spanner nodes.
The Partitioned Read API returns partitions that correspond, roughly, to these
splits, so when reading a partition, a single Spanner node is tasked with
reading the data.
Of course only having very few partition elements with actual rows may cause
issues for the downstream pipeline if the work done on these rows takes a long
time, and would benefit from additional parallelization.
I don't believe putting a reshuffle on the output of SpannerIO.Read by default
is a good idea.
For very large reads (in the TB's), this will delay the time until the first
record is emitted significantly, and may cause additional costs to shuffle the
data.
If the end user needs better parallelization of the SpannerIO.Read output, then
they can add a Reshuffle themselves.
> SpannerIO.read() parallelism limited by partition count
> -------------------------------------------------------
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
> Issue Type: Bug
> Components: io-java-gcp
> Reporter: Udi Meiri
> Priority: P3
> Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible:
> {code} Note: This hint is currently ignored by sessions.partitionQuery and
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource
> usage.
> cc: [~chamikara][~nielm]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)