to Arrow Arrays

Mahmut Bulut (Jira) Tue, 30 Jun 2020 03:29:31 -0700

Mahmut Bulut created ARROW-9275:
-----------------------------------

             Summary: [Rust] – Async Sans IO: R/W into/to Arrow Arrays
                 Key: ARROW-9275
                 URL: https://issues.apache.org/jira/browse/ARROW-9275
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust
            Reporter: Mahmut Bulut
            Assignee: Mahmut Bulut



This issue can be considered an epic level that spans across other arrow 
projects.

*Drill down*

Currently, traits like `ParquetReader` only allow synchronous interface which 
uses BufReader having 8KB constant buffer. Over the network, this becomes a 
problem. This can be easily solvable with differential buffers. In addition to 
this shortage, there is a problem of executor engine is needed to schedule from 
async trait methods to sync trait methods which should sit somewhere in between 
to make requests asynchronous to external IO. On-disk IO is acceptable with the 
approach we currently have since no reliable evented IO exists for on-disk IO 
on major platforms.

All these considered abstractions that will expose asynchronous IO without any 
side from executors, needs to be exposed.

 

*Design Suggestions & Considerations*

The design should apply and consider:
 * Sans IO, (for more information about Sans approach please see 
[https://sans-io.readthedocs.io/] ) 
 * Not including any executor specific data, at all.
 * Tests should work with any executor with little to no modification.
 * Buffers are adjusted accordingly and use differential buffers to optimize 
network trips.
 * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
traits or we do overlapping implementation, that will make our life harder in 
the future. Sans IO should be compartmentalized.

 

*Notes*

If Sans approach is not taken, the project will:
 * use an extreme amount of dependencies.
 * be not compatible with other Rust code at all.
 * break currently working code uses array ingestions.
 * integrations tests are going to be harder.
 * it will really hard to adapt to completion-based APIs stabilize in the 
future. (in the user projects)
 * this suggestion is not about the in-flight format or any in-flight related 
information atm. This is purely making on-disk, remote IO (provider backends 
like AWS etc.) async.

 

*Open points*

A couple of open points:
 * Identifying traits that are going to be asyncized.
 * Designing internal routines.
 * package name to expose.
 * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

Reply via email to