[
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niel Markwick updated BEAM-6887:
--------------------------------
Description:
At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an
entire bundle of elements, sorts them by table/key, splitting the sorted list
into batches (by size and number of cells modified) and then writes each batch
to Spanner in a single transaction.
It returns a SpannerWriteResult.java containing :
# a PCollection<Void> (the main output) - which will have no elements but will
be closed to signal when all the input elements have been written (which is
never in streaming because input is unbounded)
# a PCollection<MutationGroup> of elements that failed to write.
This transform is useful as a bulk sink for data because it efficiently writes
large amounts of data.
It is not at all useful as an intermediate step in a streaming pipeline -
because it has no useful output in streaming mode.
I propose that we have a separate Spanner Write transform which simply writes
each input Mutation to the database, and then pushes successful Mutations onto
its output.
This would allow use in the middle of a streaming pipeline, where the flow
would be
* Some data streamed in
* Converted to Spanner Mutations
* Written to Spanner Database
* Further processing where the values written to the Spanner Database are used.
was:
At present, the[
SpannerIO.Write(Grouped|http://go/gh/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L892])
transform works by collecting an entire bundle of elements, sorts them by
table/key, splitting the sorted list into batches (by size and number of cells
modified) and then writes each batch to Spanner in a single transaction.
It returns an[
object|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteResult.java]
containing :
# a PCollection<Void> (the main output) - which will have no elements but will
be closed to signal when all the input elements have been written (which is
never in streaming because input is unbounded)
# a PCollection<MutationGroup> of elements that failed to write.
This transform is useful as a bulk sink for data because it efficiently writes
large amounts of data.
It is not at all useful as an intermediate step in a streaming pipeline -
because it has no useful output in streaming mode.
I propose that we have a separate Spanner Write transform which simply writes
each input Mutation to the database, and then pushes successful Mutations onto
its output.
This would allow use in the middle of a streaming pipeline, where the flow
would be
* Some data streamed in
* Converted to Spanner Mutations
* Written to Spanner Database
* Further processing where the values written to the Spanner Database are used.
> Streaming Spanner Writer transform
> ----------------------------------
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
> Issue Type: New Feature
> Components: io-java-gcp
> Reporter: Niel Markwick
> Assignee: Niel Markwick
> Priority: Minor
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an
> entire bundle of elements, sorts them by table/key, splitting the sorted list
> into batches (by size and number of cells modified) and then writes each
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
> # a PCollection<Void> (the main output) - which will have no elements but
> will be closed to signal when all the input elements have been written (which
> is never in streaming because input is unbounded)
> # a PCollection<MutationGroup> of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently
> writes large amounts of data.
> It is not at all useful as an intermediate step in a streaming pipeline -
> because it has no useful output in streaming mode.
> I propose that we have a separate Spanner Write transform which simply writes
> each input Mutation to the database, and then pushes successful Mutations
> onto its output.
> This would allow use in the middle of a streaming pipeline, where the flow
> would be
> * Some data streamed in
> * Converted to Spanner Mutations
> * Written to Spanner Database
> * Further processing where the values written to the Spanner Database are
> used.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)