[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465238705



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   Done! `nested` is no longer specified.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465238156



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false
+examples:
+  "\x01\x00\x00\x00\x00\x00": {f_map:{}}
+  
"\x01\x00\x00\x00\x00\x02\x03foo\x01\xa9F\x03bar\x01\xff\xff\xff\xff\xff\xff\xff\xff\x7f":
 {f_map:{"foo": 9001, "bar": 9223372036854775807}}
+  
"\x01\x00\x00\x00\x00\x04\neverything\x00\x02is\x00\x05null!\x00\r\xc2\xaf\\_(\xe3\x83\x84)_/\xc2\xaf\x00":
 {f_map:{"everything":null, "is": null, "null!": null, "¯\\_(ツ)_/¯": null}}

Review comment:
   Done!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465216120



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   I bet I ran into this way back when and I just opted to pick a (bad) 
value for `nested` rather than uncovering this issue.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465215557



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   Found the issue in Java. It's a bug in CommonCodersTest that occurs when 
testing both nested and un-nested. It happens because we share an object for 
the expected value:
   
https://github.com/apache/beam/blob/1bf60b99741ff4d0a8b88e93f5241379e1f3962f/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java#L193-L196
   
   and it get's mutated when parsing Rows:
   
https://github.com/apache/beam/blob/1bf60b99741ff4d0a8b88e93f5241379e1f3962f/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java#L380
   
   It's a pretty easy fix I can apply here. Is it ok if the row test case just 
doesn't specify `nested`, or are you thinking it should be `nested: true`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465204223



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false
+examples:
+  "\x01\x00\x00\x00\x00\x00": {f_map:{}}
+  
"\x01\x00\x00\x00\x00\x02\x03foo\x01\xa9F\x03bar\x01\xff\xff\xff\xff\xff\xff\xff\xff\x7f":
 {f_map:{"foo": 9001, "bar": 9223372036854775807}}
+  
"\x01\x00\x00\x00\x00\x04\neverything\x00\x02is\x00\x05null!\x00\r\xc2\xaf\\_(\xe3\x83\x84)_/\xc2\xaf\x00":
 {f_map:{"everything":null, "is": null, "null!": null, "¯\\_(ツ)_/¯": null}}

Review comment:
   Ack, thanks





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465204097



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   > So, I re-iterate: Why is nested: false, instead of nested true if the 
coding is going to be identical in both context?
   
   You're not re-iterating. You're asking the same question with a lot of new 
and different context.
   
   I don't think there's a good reason. If I had to guess - when I added the 
first row coder test I was new to Beam and coders and I didn't really 
understand nested vs un-nested, so I just picked one. I'm trying to change it 
now and it looks like Java is failing some of the nested=true cases - that's 
probably why. Looking into the failures now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465194296



##
File path: sdks/python/apache_beam/coders/coder_impl.py
##
@@ -530,6 +530,88 @@ def estimate_size(self, unused_value, nested=False):
 return 1
 
 
+class MapCoderImpl(StreamCoderImpl):
+  """For internal use only; no backwards-compatibility guarantees.
+
+  A coder for typing.Mapping objects."""
+  def __init__(
+  self,
+  key_coder,  # type: CoderImpl
+  value_coder  # type: CoderImpl
+  ):
+self._key_coder = key_coder
+self._value_coder = value_coder
+
+  def encode_to_stream(self, value, out, nested):
+size = len(value)
+out.write_bigendian_int32(size)
+for i, kv in enumerate(value.items()):
+  key, value = kv
+  last = i == size - 1

Review comment:
   Good point, done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465180802



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   Robert was saying that the coders used to encode each attribute value 
are always nested. In the Java implementation we always call the version of 
`encode` which does not accept a context (as I understand it most 
implementations default to the nested encoding in this case):
   
https://github.com/apache/beam/blob/1bf60b99741ff4d0a8b88e93f5241379e1f3962f/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoderGenerator.java#L267-L272
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-04 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r465180802



##
File path: 
model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
##
@@ -384,3 +384,31 @@ nested: false
 examples:
   "\x02\x01\x02\x01": {f_bool: True, f_bytes: null}
   "\x02\x00\x00\x04ab\x00c": {f_bool: False, f_bytes: "ab\0c"}
+
+---
+
+# Binary data generated with the python SDK:
+#
+# import typing
+# import apache_beam as beam
+# class Test(typing.NamedTuple):
+#   f_map: typing.Mapping[str,int]
+# schema = beam.typehints.schemas.named_tuple_to_schema(Test)
+# coder = beam.coders.row_coder.RowCoder(schema)
+# print("payload = %s" % schema.SerializeToString())
+# examples = (Test(f_map={}),
+# Test(f_map={"foo": 9001, "bar": 9223372036854775807}),
+# Test(f_map={"everything": None, "is": None, "null!": None, 
"¯\_(ツ)_/¯": None}))
+# for example in examples:
+#   print("example = %s" % coder.encode(example))
+coder:
+  urn: "beam:coder:row:v1"
+  # f_map: map
+  payload: 
"\n\x15\n\x05f_map\x1a\x0c*\n\n\x02\x10\x07\x12\x04\x08\x01\x10\x04\x12$d8c8f969-14e6-457f-a8b5-62a1aec7f1cd"
+  # map ordering is non-deterministic
+  non_deterministic: True
+nested: false

Review comment:
   Robert was saying that the coders used to encode each attribute value 
are always nested. In the Java implementation we always call the version of 
`encode` which does not accept a context (as I understand it most 
implementations default to the nested encoding in this case):
   ```
   
https://github.com/apache/beam/blob/1bf60b99741ff4d0a8b88e93f5241379e1f3962f/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoderGenerator.java#L267-L272
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-08-03 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r464732307



##
File path: model/pipeline/src/main/proto/beam_runner_api.proto
##
@@ -855,10 +855,21 @@ message StandardCoders {
 // BOOLEAN:   beam:coder:bool:v1
 // BYTES: beam:coder:bytes:v1
 //   ArrayType:   beam:coder:iterable:v1 (always has a known length)
-//   MapType: not yet a standard coder (BEAM-7996)
+//   MapType: not a standard coder, specification defined below.
 //   RowType: beam:coder:row:v1
 //   LogicalType: Uses the coder for its representation.
 //
+// The MapType is encoded by:
+//   - An INT32 representing the size of the map (N)
+//   - Followed by N interleaved keys and values, encoded with their
+// corresponding coder.
+//
+// Nullable types in container types (ArrayType, MapType) are encoded by:
+//   - A one byte null indicator, 0x00 for null values, or 0x01 for present
+// values.
+//   - For present values the null indicator is followed by the value
+// encoded with it's corresponding coder.
+//

Review comment:
   That's right. Nullability is represented on the FieldType: 
https://github.com/apache/beam/blob/d5a6c2f0ceb7b3e609ba8d97e523b3bded7d0459/model/pipeline/src/main/proto/schema.proto#L47
   
   We _could_ document that schemas should not accept null keys/values in 
`MapType` and reject them in the SDKs. But there would still be an unused 
nullable field on the key and value FieldType:
   
https://github.com/apache/beam/blob/d5a6c2f0ceb7b3e609ba8d97e523b3bded7d0459/model/pipeline/src/main/proto/schema.proto#L79-L82





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463311197



##
File path: model/pipeline/src/main/proto/beam_runner_api.proto
##
@@ -855,10 +855,21 @@ message StandardCoders {
 // BOOLEAN:   beam:coder:bool:v1
 // BYTES: beam:coder:bytes:v1
 //   ArrayType:   beam:coder:iterable:v1 (always has a known length)
-//   MapType: not yet a standard coder (BEAM-7996)
+//   MapType: not a standard coder, specification defined below.
 //   RowType: beam:coder:row:v1
 //   LogicalType: Uses the coder for its representation.
 //
+// The MapType is encoded by:
+//   - An INT32 representing the size of the map (N)
+//   - Followed by N interleaved keys and values, encoded with their
+// corresponding coder.
+//
+// Nullable types in container types (ArrayType, MapType) are encoded by:
+//   - A one byte null indicator, 0x00 for null values, or 0x01 for present
+// values.
+//   - For present values the null indicator is followed by the value
+// encoded with it's corresponding coder.
+//

Review comment:
   This is specifically for nullable types that are elements of an Array or 
keys/values of a Map. For rows we encode nulls in a separate bitmask:
   
https://github.com/apache/beam/blob/587dde57cbb2b0095a1fa04b59798d1b62c66f18/model/pipeline/src/main/proto/beam_runner_api.proto#L837-L843
   
   Java schemas will already let you use nullable types for keys and values in 
Maps. I doubt anyone is relying on it.. but there's a risk if we disallow it 
now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463311197



##
File path: model/pipeline/src/main/proto/beam_runner_api.proto
##
@@ -855,10 +855,21 @@ message StandardCoders {
 // BOOLEAN:   beam:coder:bool:v1
 // BYTES: beam:coder:bytes:v1
 //   ArrayType:   beam:coder:iterable:v1 (always has a known length)
-//   MapType: not yet a standard coder (BEAM-7996)
+//   MapType: not a standard coder, specification defined below.
 //   RowType: beam:coder:row:v1
 //   LogicalType: Uses the coder for its representation.
 //
+// The MapType is encoded by:
+//   - An INT32 representing the size of the map (N)
+//   - Followed by N interleaved keys and values, encoded with their
+// corresponding coder.
+//
+// Nullable types in container types (ArrayType, MapType) are encoded by:
+//   - A one byte null indicator, 0x00 for null values, or 0x01 for present
+// values.
+//   - For present values the null indicator is followed by the value
+// encoded with it's corresponding coder.
+//

Review comment:
   This is specifically for nullable types that are elements of an Array or 
keys/values of a Map. For rows we nulls in a separate bitmask:
   
https://github.com/apache/beam/blob/587dde57cbb2b0095a1fa04b59798d1b62c66f18/model/pipeline/src/main/proto/beam_runner_api.proto#L837-L843
   
   Java schemas will already let you use nullable types for keys and values in 
Maps. I doubt anyone is relying on it.. but there's a risk if we disallow it 
now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463308693



##
File path: sdks/python/apache_beam/coders/coder_impl.py
##
@@ -530,6 +530,88 @@ def estimate_size(self, unused_value, nested=False):
 return 1
 
 
+class MapCoderImpl(StreamCoderImpl):
+  """For internal use only; no backwards-compatibility guarantees.
+
+  A coder for typing.Mapping objects."""
+  def __init__(
+  self,
+  key_coder,  # type: CoderImpl
+  value_coder  # type: CoderImpl
+  ):
+self._key_coder = key_coder
+self._value_coder = value_coder
+
+  def encode_to_stream(self, value, out, nested):
+size = len(value)
+out.write_bigendian_int32(size)

Review comment:
   Yeah I thought about that too. I was just trying to exactly replicate 
what we do in [MapCoder.java] 
(https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/MapCoder.java)
 for now though. Do you think I should switch this and MapCoder.java over to 
using varint here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463308917



##
File path: sdks/python/apache_beam/coders/coder_impl.py
##
@@ -530,6 +530,88 @@ def estimate_size(self, unused_value, nested=False):
 return 1
 
 
+class MapCoderImpl(StreamCoderImpl):
+  """For internal use only; no backwards-compatibility guarantees.
+
+  A coder for typing.Mapping objects."""
+  def __init__(
+  self,
+  key_coder,  # type: CoderImpl
+  value_coder  # type: CoderImpl
+  ):
+self._key_coder = key_coder
+self._value_coder = value_coder
+
+  def encode_to_stream(self, value, out, nested):
+size = len(value)
+out.write_bigendian_int32(size)
+for i, kv in enumerate(value.items()):
+  key, value = kv
+  last = i == size - 1

Review comment:
   Same comment as above, this was just an attempt to replicate the Java 
logic. I could change them both if you think we should





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463308693



##
File path: sdks/python/apache_beam/coders/coder_impl.py
##
@@ -530,6 +530,88 @@ def estimate_size(self, unused_value, nested=False):
 return 1
 
 
+class MapCoderImpl(StreamCoderImpl):
+  """For internal use only; no backwards-compatibility guarantees.
+
+  A coder for typing.Mapping objects."""
+  def __init__(
+  self,
+  key_coder,  # type: CoderImpl
+  value_coder  # type: CoderImpl
+  ):
+self._key_coder = key_coder
+self._value_coder = value_coder
+
+  def encode_to_stream(self, value, out, nested):
+size = len(value)
+out.write_bigendian_int32(size)

Review comment:
   Yeah I thought about that too. I was just trying to exactly replicate 
what we do in 
[MapCoder.java](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/MapCoder.java)
 for now though. Do you think I should switch this and MapCoder.java over to 
using varint here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463194957



##
File path: model/pipeline/src/main/proto/beam_runner_api.proto
##
@@ -855,10 +855,21 @@ message StandardCoders {
 // BOOLEAN:   beam:coder:bool:v1
 // BYTES: beam:coder:bytes:v1
 //   ArrayType:   beam:coder:iterable:v1 (always has a known length)
-//   MapType: not yet a standard coder (BEAM-7996)
+//   MapType: not a standard coder, specification defined below.
 //   RowType: beam:coder:row:v1
 //   LogicalType: Uses the coder for its representation.
 //
+// The MapType is encoded by:
+//   - An INT32 representing the size of the map (N)
+//   - Followed by N interleaved keys and values, encoded with their
+// corresponding coder.
+//
+// Nullable types in container types (ArrayType, MapType) are encoded by:
+//   - A one byte null indicator, 0x00 for null values, or 0x01 for present
+// values.
+//   - For present values the null indicator is followed by the value
+// encoded with it's corresponding coder.
+//

Review comment:
   FYI @lostluck - previously the encoding for MapType and for nullable 
types within containers was not documented.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [beam] TheNeuralBit commented on a change in pull request #12426: [BEAM-7996] Add support for MapType and Nulls in container types for Python RowCoder

2020-07-30 Thread GitBox


TheNeuralBit commented on a change in pull request #12426:
URL: https://github.com/apache/beam/pull/12426#discussion_r463194527



##
File path: 
runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java
##
@@ -366,12 +368,15 @@ private static Object parseField(Object value, 
Schema.FieldType fieldType) {
 .map((element) -> parseField(element, 
fieldType.getCollectionElementType()))
 .collect(toImmutableList());
   case MAP:
-Map kvMap = (Map) value;
-return kvMap.entrySet().stream()
-.collect(
-toImmutableMap(
-(pair) -> parseField(pair.getKey(), 
fieldType.getMapKeyType()),
-(pair) -> parseField(pair.getValue(), 
fieldType.getMapValueType(;
+Map kvMap = new HashMap<>();
+((Map) value)
+.entrySet().stream()
+.forEach(
+(entry) ->
+kvMap.put(
+parseField(entry.getKey(), 
fieldType.getMapKeyType()),
+parseField(entry.getValue(), 
fieldType.getMapValueType(;
+return kvMap;

Review comment:
   This change is necessary because `ImmutableMap` (as well as 
`Collectors.toMap`) does not allow null values, so it errors on the new test 
cases.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org