Ma77Ball opened a new issue, #5598:
URL: https://github.com/apache/texera/issues/5598

   ### What happened?
   
   `Tuple.as_dict()` in the Python engine performs a full recursive `deepcopy` 
of the tuple's field data on **every call**:
   
   ```python
   # amber/src/main/python/core/models/tuple.py
   def as_dict(self) -> "OrderedDict[str, Field]":
       for i in self.get_field_names():
           self.__getitem__(i)
       return deepcopy(self._field_data)   # clones every value, recursively
   ```
   
   `as_dict()` sits on a hot path: it backs `as_series()` (used per-tuple in 
the batch operator path, `operator.py:227`) and `as_key_value_pairs()` (used by 
`__hash__`, `__str__`, `validate_schema`). Because it is a `deepcopy`, the cost 
is proportional to the **total byte size** of all field values, not the number 
of fields. A tuple carrying a large field (for example a multi-MB 
binary/`largebinary` value) duplicates that entire payload on every read.
   
   The deepcopy is defensive: it isolates the returned dict so a caller cannot 
corrupt the tuple's internal state. But that protection is not actually needed 
given how the code is structured:
   
   - `as_dict()` has **zero callers outside the `Tuple` class**; its only two 
users (`as_series`, `as_key_value_pairs`) immediately build a *new* container 
(a pandas `Series` / a fresh list), so the raw dict is never retained and 
mutated.
   - The tuple's own mutators (`__setitem__`, `cast_to_schema`) only ever 
**reassign** dict slots; they never mutate a field value in place. So a tuple 
and a shallow-copied dict can safely share value references.
   - The single production caller of `as_series()` (`operator.py:227`) `pop`s 
the tuple out of the batch and discards it immediately after conversion, so no 
live tuple aliases the converted data.
   - Fan-out / broadcast does **not** rely on this deepcopy: cross-branch 
isolation is enforced by Arrow serialization on the send path 
(`output_manager.py:316` via the read-only `get_serialized_field`), which never 
calls `as_dict`/`as_series`.
   
   **Expected:** reading a tuple as a dict should be cheap (proportional to 
field count), not duplicate the full payload of every field on each read.
   
   **Fix applied (for review):** replace the per-read `deepcopy` with a shallow 
copy, preserving the documented "returns an independent dict" contract while 
dropping the recursive value cloning; remove the now-unused `from copy import 
deepcopy` import; document why a shallow copy is correct.
   
   ```python
   return self._field_data.copy()   # shallow copy: new dict, shared value refs
   ```
   
   ### How to reproduce?
   
   1. Build a `Tuple` with a large binary field, for example a 
`bytes`/`largebinary` value of several MB.
   2. Call `tuple_.as_dict()` (or `as_series()`) in a loop, for example 10,000 
times, mirroring a per-tuple operator read.
   3. Observe CPU time and peak memory: each call recursively clones the 
multi-MB value. Time/allocations scale with field byte size times call count.
   4. With the shallow-copy fix, per-call cost drops to O(number of fields) and 
is independent of field payload size.
   
   Affected path: `amber/src/main/python/core/models/tuple.py` -> `as_dict()` 
(line ~244), reached via `as_series()` in `core/models/operator.py:227`.
   
   
   
   ### Version/Branch
   
   1.3.0-incubating-SNAPSHOT (main)
   
   ### Commit Hash (Optional)
   
   _No response_
   
   ### What browsers are you seeing the problem on?
   
   _No response_
   
   ### Relevant log output
   
   ```shell
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to