[jira] [Work logged] (BEAM-4359) String encoding for a spanner mutation assumes that string length equals bytes length

ASF GitHub Bot (JIRA) Fri, 08 Feb 2019 09:24:10 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-4359?focusedWorklogId=196322&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-196322
 ]


ASF GitHub Bot logged work on BEAM-4359:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Feb/19 17:23
            Start Date: 08/Feb/19 17:23
    Worklog Time Spent: 10m 
      Work Description: nielm commented on issue #7747: [BEAM-4359] Check for 
null values when encoding Key columns.
URL: https://github.com/apache/beam/pull/7747#issuecomment-461794598
 
 
   > LGTM. Will merge after post-commit tests pass.
   
   Post-commit tests failing are nothing to do with his PR -- its in 
.testGcsWriteWithKmsKey, and are also failing at master.
   
   It looks like it is caused by PR #7780 -- it uses a `--kmsKey`  option for 
the pipeline runner, and the test error message is 
   
   ```
   java.lang.IllegalArgumentException: Class interface 
org.apache.beam.sdk.testing.TestPipelineOptions missing a property named 
'kmsKey'.
   ```
   
   I suspect that the option should be `--dataflowKmsKey` based on 
[GcsOptions.java in PR 
#7682](https://github.com/apache/beam/commit/8150d3be73e31595d757972c36e3b05fb6589fe8#diff-2c4be02e3be9f0d2cbfc0866bd1de5baR405)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 196322)
    Time Spent: 2.5h  (was: 2h 20m)

> String encoding for a spanner mutation assumes that string length equals 
> bytes length
> -------------------------------------------------------------------------------------
>
>                 Key: BEAM-4359
>                 URL: https://issues.apache.org/jira/browse/BEAM-4359
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-gcp
>    Affects Versions: 2.4.0
>            Reporter: Sivanand
>            Assignee: Chamikara Jayalath
>            Priority: Major
>             Fix For: 2.7.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The bug is here:
> [https://github.com/apache/beam/blob/3ba96003d31ce98a54c0c51c1c0a9cf7c06e2fa2/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/MutationGroupEncoder.java#L231-L235]
> {code:java}
> case STRING: {
>         String str = value.getString();
>         VarInt.encode(str.length(), bos);
>         bos.write(str.getBytes(StandardCharsets.UTF_8));
>         break;
> }
> {code}
>  
> The code assumes that the number of bytes used to represent a UTF-8 String 
> equals the string length. This is not true because a UTF-8 character can be 
> encoded using 1 - 4 bytes.
> From wikipedia: [https://en.wikipedia.org/wiki/UTF-8]
> {quote}UTF-8 is a variable width character encoding capable of encoding all 
> 1,112,064 valid code points in Unicode using one to four 8-bit bytes
> {quote}
> Code to recreate the issue:
> {code:java}
> /*
> Schema in spanner
> CREATE TABLE test (
>   id INT64,
>   testString STRING(MAX),
>   number INT64,
> ) PRIMARY KEY (id)
> */
>     import com.google.cloud.spanner.Mutation;
>     import com.google.common.collect.Lists;
>     import org.apache.beam.runners.direct.DirectRunner;
>     import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
>     import org.apache.beam.sdk.testing.TestPipeline;
>     import org.apache.beam.sdk.transforms.Create;
>     import org.apache.beam.sdk.transforms.DoFn;
>     import org.apache.beam.sdk.transforms.ParDo;
>     import org.junit.Rule;
>     import org.junit.Test;
>     
>     import java.io.Serializable;
>     import java.util.List;
>     
>     public class BeamSpannerTest implements Serializable {
>     
>         @Rule
>         public transient TestPipeline pipeline = TestPipeline.create();
>     
>         @Test
>         public void testSpanner() {
>             pipeline.getOptions().setRunner(DirectRunner.class);
>     
>             List<String> strdata = Lists.newArrayList("၃7");
>     
>     
>             pipeline.apply(
>                 Create.of(strdata)
>             ).apply(ParDo.of(new DoFn<String, Mutation>() {
>                 @ProcessElement
>                 public void processElement(ProcessContext c) {
>                     String value = c.element();
>                     c.output(Mutation.newInsertOrUpdateBuilder("test")
>                         .set("id").to(1)
>                         .set("testString").to(value)
>                         .set("number").to(10)
>                         .build());
>                 }
>             })
>            ).apply("Write to Spanner", SpannerIO.write()
>                     .withProjectId("my-project")
>                     .withInstanceId("spanner-instance")
>                     .withDatabaseId("test")
>             );
>     
>             pipeline.run();
>         }
>     }
> {code}
> After running the code, the value in the column {{number}} will be {{7043}} 
> and not {{10}} because the bytes from the previous column {{testString}} have 
> spilled into the {{number}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4359) String encoding for a spanner mutation assumes that string length equals bytes length

Reply via email to