kunwp1 opened a new issue, #3787:
URL: https://github.com/apache/texera/issues/3787

   `Problem`
   
   Texera processes data in relational tables, where values are currently 
inlined within tuples. Consider the case where a user uploads a file and wants 
to process it through multiple operators in Texera. The straightforward way to 
represent the file is as a binary attribute (i.e., a cell in a tuple).
   
   Sally’s use case:
   Sally uploads an 8 GB dataset and wants to process it with an R UDF operator 
in Texera. One way to do so is use a file scan operator to embed the file in a 
tuple as a binary attribute and passes it downstream (e.g., scan →R UDF). 
However, this fails for at least four reasons:
   
   1. Java byte[] constraint - JVM arrays are indexed by int, so a single array 
is limited to ~2 GB. An 8 GB payload cannot be stored in a single tuple field.
   2. Operator messaging - Texera uses Akka for inter-operator communication, 
which ultimately serializes data into byte arrays or equivalent formats (Java 
serialization, Protobuf, Kryo). These inherit the same <2 GB ceiling.
   3. Parquet cell size constraint (via Apache Iceberg) - Parquet requires each 
cell value to fit within a single byte[], which is limited to <2 GB in the Java 
implementation.
   4. Arrow limitation - Apache Arrow also enforces a 2 GB maximum per array 
buffer.
   
   `Design`
   
   - We’ll externalize large objects using pointers. The main motivation is the 
R UDF operator. It relies on Arrow for serialization/deserialization, and Arrow 
can only handle objects smaller than 2GB.
   - Assumptions: no fault tolerance for now, read-only usage, and each pointer 
represents a single cell.
   - We’ll introduce a BigObjectManager module to manage the object lifecycle. 
The details need to be refined. The tentative APIs are:
     - create(bigObjectFile) -> ptr
     - open(ptr) -> bigObjectFileStream
     - delete(ptr)
   - The lifecycle of a big object follows the workflow execution lifecycle:
     - Created when an operator outputs a value larger than 2GB (or above a 
threshold).
     - Deleted when the workflow execution is cleaned up.
   - We don't need to consider reference count in the current design if we 
bundle the lifecycle of a big object with the workflow execution.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to