[ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380010#comment-17380010
 ] 

Niel Markwick commented on BEAM-11722:
--------------------------------------

I did some experiments on this, comparing: 

 
 # Reading a table using a single thread
 # Reading using the pre-supplied partitions from the Spanner Partitioned Read 
API
(in my test, this supplied 38 partitions, but only 3 had data)
 # Splitting the key into 1000 key-ranges, shuffling them, and then reading 
these 1000 in parallel using multiple workers. 

1) was obviously the slowest. 

2) and 3) were about equally fast to read the actual data - implying that 
despite the relatively low number of populated partitions, this was the most 
efficient way to read the data from the database.

 

Background information: Spanner stores the tables by creating blocks of 
key-ranges and assigning ownership of these key-ranges (splits) to specific 
Spanner nodes. 

The Partitioned Read API returns partitions that correspond, roughly, to these 
splits, so when reading a partition, a single Spanner node is tasked with 
reading the data. 

 

Of course only having very few partition elements with actual rows may cause  
issues for the downstream pipeline if the work done on these rows takes a long 
time, and would benefit from additional parallelization.

 

I don't believe putting a reshuffle on the output of SpannerIO.Read by default 
is a good idea.
For very large reads (in the TB's), this will delay the time until the first 
record is emitted significantly, and may cause additional costs to shuffle the 
data. 

If the end user needs better parallelization of the SpannerIO.Read output, then 
they can add a Reshuffle themselves.

 

 

> SpannerIO.read() parallelism limited by partition count
> -------------------------------------------------------
>
>                 Key: BEAM-11722
>                 URL: https://issues.apache.org/jira/browse/BEAM-11722
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>            Reporter: Udi Meiri
>            Priority: P3
>              Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to