[jira] [Work logged] (BEAM-4359) String encoding for a spanner mutation assumes that string length equals bytes length

ASF GitHub Bot (JIRA) Wed, 06 Feb 2019 05:02:24 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-4359?focusedWorklogId=195056&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-195056
 ]


ASF GitHub Bot logged work on BEAM-4359:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Feb/19 13:01
            Start Date: 06/Feb/19 13:01
    Worklog Time Spent: 10m 
      Work Description: nielm commented on pull request #7747: [BEAM-4359] 
Check for null values when encoding Key columns.
URL: https://github.com/apache/beam/pull/7747#discussion_r254260659
 
 

 ##########
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/MutationKeyEncoder.java
 ##########
 @@ -77,7 +77,7 @@ private void encodeKey(OrderedCode orderedCode, Mutation m) {
     Map<String, Value> mutationMap = mutationAsMap(m);
     for (SpannerSchema.KeyPart part : schema.getKeyParts(m.getTable())) {
       Value val = mutationMap.get(part.getField());
-      if (val.isNull()) {
+      if (val == null || val.isNull()) {
 
 Review comment:
   The database engine considers missing values as NULL - this code is 
mimicking the same behavior. 
   
   Note also that this code is only concerned with Primary Key columns, and the 
encoded value is only used for sorting mutations in a bundle by table and 
primary key -- the value encoded here is never written to the database.
   
   NOT NULL constraints can be set on any column. If the Mutation is an insert 
and does not specify a column which is NOT NULL (or sets it to NULL), then the 
insert will fail, and the Mutation appended to a PCollection of failed 
MutationGroups which is output by the SpannerIO.Write transform.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 195056)
    Time Spent: 40m  (was: 0.5h)

> String encoding for a spanner mutation assumes that string length equals 
> bytes length
> -------------------------------------------------------------------------------------
>
>                 Key: BEAM-4359
>                 URL: https://issues.apache.org/jira/browse/BEAM-4359
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.4.0
>            Reporter: Sivanand
>            Assignee: Chamikara Jayalath
>            Priority: Major
>             Fix For: 2.7.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The bug is here:
> [https://github.com/apache/beam/blob/3ba96003d31ce98a54c0c51c1c0a9cf7c06e2fa2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/MutationGroupEncoder.java#L231-L235]
> {code:java}
> case STRING: {
>         String str = value.getString();
>         VarInt.encode(str.length(), bos);
>         bos.write(str.getBytes(StandardCharsets.UTF_8));
>         break;
> }
> {code}
>  
> The code assumes that the number of bytes used to represent a UTF-8 String 
> equals the string length. This is not true because a UTF-8 character can be 
> encoded using 1 - 4 bytes.
> From wikipedia: [https://en.wikipedia.org/wiki/UTF-8]
> {quote}UTF-8 is a variable width character encoding capable of encoding all 
> 1,112,064 valid code points in Unicode using one to four 8-bit bytes
> {quote}
> Code to recreate the issue:
> {code:java}
> /*
> Schema in spanner
> CREATE TABLE test (
>   id INT64,
>   testString STRING(MAX),
>   number INT64,
> ) PRIMARY KEY (id)
> */
>     import com.google.cloud.spanner.Mutation;
>     import com.google.common.collect.Lists;
>     import org.apache.beam.runners.direct.DirectRunner;
>     import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
>     import org.apache.beam.sdk.testing.TestPipeline;
>     import org.apache.beam.sdk.transforms.Create;
>     import org.apache.beam.sdk.transforms.DoFn;
>     import org.apache.beam.sdk.transforms.ParDo;
>     import org.junit.Rule;
>     import org.junit.Test;
>     
>     import java.io.Serializable;
>     import java.util.List;
>     
>     public class BeamSpannerTest implements Serializable {
>     
>         @Rule
>         public transient TestPipeline pipeline = TestPipeline.create();
>     
>         @Test
>         public void testSpanner() {
>             pipeline.getOptions().setRunner(DirectRunner.class);
>     
>             List<String> strdata = Lists.newArrayList("၃7");
>     
>     
>             pipeline.apply(
>                 Create.of(strdata)
>             ).apply(ParDo.of(new DoFn<String, Mutation>() {
>                 @ProcessElement
>                 public void processElement(ProcessContext c) {
>                     String value = c.element();
>                     c.output(Mutation.newInsertOrUpdateBuilder("test")
>                         .set("id").to(1)
>                         .set("testString").to(value)
>                         .set("number").to(10)
>                         .build());
>                 }
>             })
>            ).apply("Write to Spanner", SpannerIO.write()
>                     .withProjectId("my-project")
>                     .withInstanceId("spanner-instance")
>                     .withDatabaseId("test")
>             );
>     
>             pipeline.run();
>         }
>     }
> {code}
> After running the code, the value in the column {{number}} will be {{7043}} 
> and not {{10}} because the bytes from the previous column {{testString}} have 
> spilled into the {{number}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4359) String encoding for a spanner mutation assumes that string length equals bytes length

Reply via email to