[
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799060#comment-16799060
]
Niel Markwick commented on BEAM-6887:
-------------------------------------
Implementation discussion:
This would have to be a separate transform from the existing SpannerIO.Write.
I propose a name change from SpannerIO.Write to SpannerIO.BulkWrite
New class SpannerIO.Write would inherit from SpannerIO.BulkWrite in order to
preserve backward compatibility, but will be marked as deprecated.
Similarly the builder function SpannerIO.write() would be renamed to
SpannerIO.bulkWrite(), and SpannerIO.write would be a deprecated wrapper.
New transform SpannerIO.SimpleWrite (or something with a better name) would be
added along with SpannerIO.simpleWrite() function.
This new transform would receive Mutations as input, attempt to write them to
Spanner, then generate 2 output streams:
Successfully written Mutations would be output to the main output so that
processing can continue.
Failed Mutations would be output to a secondary output so that they can be
handled separately.
(This would all be duplicated for WriteGrouped)
> Streaming Spanner Writer transform
> ----------------------------------
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
> Issue Type: New Feature
> Components: io-java-gcp
> Reporter: Niel Markwick
> Assignee: Niel Markwick
> Priority: Minor
>
> At present, the[
> SpannerIO.Write(Grouped|http://go/gh/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L892])
> transform works by collecting an entire bundle of elements, sorts them by
> table/key, splitting the sorted list into batches (by size and number of
> cells modified) and then writes each batch to Spanner in a single
> transaction.
> It returns an[
> object|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteResult.java]
> containing :
> # a PCollection<Void> (the main output) - which will have no elements but
> will be closed to signal when all the input elements have been written (which
> is never in streaming because input is unbounded)
> # a PCollection<MutationGroup> of elements that failed to write.
>
> This transform is useful as a bulk sink for data because it efficiently
> writes large amounts of data.
> It is not at all useful as an intermediate step in a streaming pipeline -
> because it has no useful output in streaming mode.
> I propose that we have a separate Spanner Write transform which simply writes
> each input Mutation to the database, and then pushes successful Mutations
> onto its output.
> This would allow use in the middle of a streaming pipeline, where the flow
> would be
> * Some data streamed in
> * Converted to Spanner Mutations
> * Written to Spanner Database
> * Further processing where the values written to the Spanner Database are
> used.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)