AviKndr opened a new issue, #39048:
URL: https://github.com/apache/beam/issues/39048
### What happened?
### What happened?
`apache_beam.coders.VarIntCoder` fails to encode Python integers in the
unsigned 64-bit range `[2**63, 2**64)`. When the Cython-compiled coders are in
use, encoding such a value raises `OverflowError`, even though the value has a
well-defined 64-bit VarInt representation.
This commonly bites users who pass values that are naturally unsigned 64-bit
— hashes, IDs, values read from systems that expose `uint64`/`unsigned long`
columns, bit masks, etc. — through a pipeline that uses `VarIntCoder`
(directly, or indirectly via `TupleCoder`, `IterableCoder`, key coders, etc.).
**Minimal reproduction:**
```python
from apache_beam.coders import VarIntCoder
c = VarIntCoder()
c.encode(2**63 - 1) # OK -> max signed int64
c.encode(2**63) # OverflowError on compiled coders (first uint64 value)
c.encode(2**64 - 1) # OverflowError (max uint64)
```
### Root cause
In the Cython build, the stream methods declare a **signed** `int64_t`
parameter:
```cython
# sdks/python/apache_beam/coders/stream.pyx
cpdef write_var_int64(self, libc.stdint.int64_t signed_v):
cdef libc.stdint.uint64_t v = signed_v # body already operates on
unsigned
...
```
The method body immediately re-casts to `uint64_t` and would encode the bit
pattern correctly. But Cython converts the incoming Python int to the signed
`int64_t` parameter **at the call boundary, before the body runs**, so any
value `> 2**63 - 1` is rejected. `get_varint_size` (used by `estimate_size`)
has the same signature and the same problem.
The encoding itself is unambiguous: a uint64 value `v` and the signed int64
`v - 2**64` (its two's-complement twin) produce **identical** VarInt bytes —
which is exactly how Java's signed `VarIntCoder` already behaves on the wire.
### What you expected to happen
`VarIntCoder` should encode any value whose 64-bit two's-complement
representation is well-defined (range `[-2**63, 2**64)`), producing bytes that
are wire-compatible with the signed Java `VarIntCoder`, rather than raising
`OverflowError` for the unsigned half of the 64-bit range.
### Notes / scope
- Decoding is and remains **signed** (`read_var_int64` returns `int64_t`),
matching Java — so `decode(encode(2**64 - 1)) == -1`. This issue is about
removing the spurious encode-time crash and guaranteeing correct wire bytes,
not about adding an unsigned round-trip (that would be a separate coder).
- Values `>= 2**64` are genuinely out of range and should still raise.
- The pure-Python (`slow_stream`) fallback does not hit the `OverflowError`
because it uses arbitrary-precision ints, so the symptom is specific to the
compiled coders — but a fix should keep both paths byte-identical.
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [x] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Prism Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]