We're working with rows that are at least 50-100 fields easily, so the savings are non-trivial, especially over a few billion rows that a job would process. :)
> existing overhead of checking each field for nullness when writing the bit set coincidentally I recently optimized this on the java side for exactly this reason! On Tue, Oct 12, 2021 at 5:15 PM Robert Burke <[email protected]> wrote: > At present the Go SDK has not been validated for updating streaming > pipelines etc for coders and such. > > I don't think it would be difficult to expand what is acceptable on the > encoded side, and then also encode values more compactly if possible. > > I do question if the compactness is even worthwhile: the type would need > 8+ trailing nil fields for the benefit to even exist and the benefit is a > single byte. One would need very wide rows indeed to drop 8bytes or more, > and i suspect the existing overhead of checking each field for nullness > when writing the bit set is already more than dropping a handful of bytes > per copy on the wire. > > On Tue, Oct 12, 2021, 2:04 PM Brian Hulette <[email protected]> wrote: > >> I wrote that spec, my intention was to document what the Java >> implementation was doing. I'm sorry my poor description has gotten us in >> this situation. I agree with Reuven that we should avoid changing the Java >> implementation for a few reasons: >> - As Reuven pointed out, we support update compatibility for row coder in >> Java. This isn't true for Python, Go (is it?) >> - As Steve pointed out, it's more efficient. >> - Personally, my intention from the outset was to match the Java >> implementation. I'd prefer to fix my mistake. >> >> I can also provide a few pointers for the original questions: >> - Regarding Python support for additional types, see BEAM-7996 [1] >> - Regarding cross-language testing, see tests for beam:coder:row:v1 in >> standard_coders.yaml [2] >> >> Brian >> >> [1] https://issues.apache.org/jira/browse/BEAM-7996 >> [2] >> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml >> >> >> On Tue, Oct 12, 2021 at 1:55 PM Steve Niemitz <[email protected]> >> wrote: >> >>> imo supporting stripping trailing 0 bytes from the bitset encoding is >>> the way to go. It's both backwards compatible with existing serialized >>> rows, as well as more space efficient. >>> >>> On Tue, Oct 12, 2021 at 4:52 PM Reuven Lax <[email protected]> wrote: >>> >>>> Of course. But if we write a spec that doesn't agree with the existing >>>> Java coders, that is also futile. We can't easily change Java coders due to >>>> update compatibility concerns, and we definitely need portability to work >>>> with Java! >>>> >>>> On Tue, Oct 12, 2021 at 1:47 PM Robert Burke <[email protected]> >>>> wrote: >>>> >>>>> If it's not in the spec it's not Beam, because any alternative is >>>>> Anti Portability ;) >>>>> >>>>> On Tue, Oct 12, 2021, 1:45 PM Reuven Lax <[email protected]> wrote: >>>>> >>>>>> Row just uses the existing Java BitSetCoder, which predates the >>>>>> writing of that spec :) >>>>>> >>>>>> On Tue, Oct 12, 2021 at 1:42 PM Robert Burke <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> The null fields bitset encoder is defines in the pipeline runner >>>>>>> proto here: >>>>>>> https://github.com/apache/beam/blob/4b11efdf96ea4a471e078ec49906c40ef033aafb/model/pipeline/src/main/proto/beam_runner_api.proto#L976 >>>>>>> >>>>>>> Per my reading of the spec, the bit set must include the ceiling of >>>>>>> num_fields/8 bytes, as it doesn't say "trailing bytes for non-nil in >>>>>>> fields >>>>>>> may be dropped". However it might be interpreted as that by the that an >>>>>>> empty byte array indicating no nils. This is what go implements in the >>>>>>> coder.WriteRowHeader and coder.ReadRowHeader functions. >>>>>>> >>>>>>> But that strikes me as a special case for fully populated rows, not >>>>>>> a natural extension of a poorly phrased general rule. >>>>>>> >>>>>>> On Tue, Oct 12, 2021, 1:31 PM Reuven Lax <[email protected]> wrote: >>>>>>> >>>>>>>> Do you think that BitSetCoder is incorrect? >>>>>>>> >>>>>>>> On Tue, Oct 12, 2021 at 1:27 PM Steve Niemitz <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yeah I believe they're all bugs/missing features in the python >>>>>>>>> implementation. The nullable BitSet one is arguably a bug in the java >>>>>>>>> implementation, but since there's no low-level spec on how Rows are >>>>>>>>> actually encoded it's hard to say who's right. I think Go might have >>>>>>>>> the >>>>>>>>> same bug there, in which case that's two languages doing it "wrong" >>>>>>>>> and one >>>>>>>>> doing it "right". :P >>>>>>>>> >>>>>>>>> On Tue, Oct 12, 2021 at 4:20 PM Reuven Lax <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> These are bugs in Python, correct? >>>>>>>>>> >>>>>>>>>> On Tue, Oct 12, 2021 at 1:18 PM Steve Niemitz < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> It seems like there's a good amount of incompatibility between >>>>>>>>>>> java and python wrt beam Rows. For example the following are >>>>>>>>>>> unsupported >>>>>>>>>>> in python (that I've noticed so far) >>>>>>>>>>> - BYTE >>>>>>>>>>> - INT16 >>>>>>>>>>> - OneOf >>>>>>>>>>> >>>>>>>>>>> Additionally, it seems like nullable fields don't really work >>>>>>>>>>> correctly, the java BitSetCoder won't encoding trailing empty bytes >>>>>>>>>>> in the >>>>>>>>>>> BitSet, but the python side is expecting every num_fields / 8 bytes >>>>>>>>>>> to be >>>>>>>>>>> present. [1] >>>>>>>>>>> >>>>>>>>>>> Certainly these are bugs, but in general it seems to point to a >>>>>>>>>>> lack of integration testing for xlang interop in general. I plan on >>>>>>>>>>> submitting PRs to fix the bugs above (or at least some of them), >>>>>>>>>>> are there >>>>>>>>>>> tests I can change to better exercise these paths? >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/row_coder.py#L198 >>>>>>>>>>> >>>>>>>>>>
