kunwp1 opened a new issue, #3787:
URL: https://github.com/apache/texera/issues/3787
`Problem`
Texera processes data in relational tables, where values are currently
inlined within tuples. Consider the case where a user uploads a file and wants
to process it through multiple operators in Texera. The straightforward way to
represent the file is as a binary attribute (i.e., a cell in a tuple).
Sally’s use case:
Sally uploads an 8 GB dataset and wants to process it with an R UDF operator
in Texera. One way to do so is use a file scan operator to embed the file in a
tuple as a binary attribute and passes it downstream (e.g., scan →R UDF).
However, this fails for at least four reasons:
1. Java byte[] constraint - JVM arrays are indexed by int, so a single array
is limited to ~2 GB. An 8 GB payload cannot be stored in a single tuple field.
2. Operator messaging - Texera uses Akka for inter-operator communication,
which ultimately serializes data into byte arrays or equivalent formats (Java
serialization, Protobuf, Kryo). These inherit the same <2 GB ceiling.
3. Parquet cell size constraint (via Apache Iceberg) - Parquet requires each
cell value to fit within a single byte[], which is limited to <2 GB in the Java
implementation.
4. Arrow limitation - Apache Arrow also enforces a 2 GB maximum per array
buffer.
`Design`
- We’ll externalize large objects using pointers. The main motivation is the
R UDF operator. It relies on Arrow for serialization/deserialization, and Arrow
can only handle objects smaller than 2GB.
- Assumptions: no fault tolerance for now, read-only usage, and each pointer
represents a single cell.
- We’ll introduce a BigObjectManager module to manage the object lifecycle.
The details need to be refined. The tentative APIs are:
- create(bigObjectFile) -> ptr
- open(ptr) -> bigObjectFileStream
- delete(ptr)
- The lifecycle of a big object follows the workflow execution lifecycle:
- Created when an operator outputs a value larger than 2GB (or above a
threshold).
- Deleted when the workflow execution is cleaned up.
- We don't need to consider reference count in the current design if we
bundle the lifecycle of a big object with the workflow execution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]