[GitHub] [arrow-datafusion] jorgecarleitao opened a new pull request #68: Experimenting with arrow2

GitBox Sun, 25 Apr 2021 07:23:42 -0700


jorgecarleitao opened a new pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68



   I have been experimenting using arrow2 & parquet2 as backends for 
DataFusion. This is WIP and does not compile, but I would like to give some 
visibility to this work.
   
   So far, I was able to keep all functionality, with a guaranteed increase in 
security and potentially some performance.
   
   Goals:
   * compile
   * tests pass
   * re-write readers and writers to leverage parallelism of both `parquet2` 
and the CSV reader in arrow2
   * ???
   * profit
   
   -------------
   
   Some notes:
   
   * I removed `CastOptions` because casting does not need to be fallible; we 
can make any non-castable null and recover the set of failed casts from the 
differences in validity between the original array and the casted array, if so 
we wish.
   
   * most kernels in Arrow2 return a `Box<dyn Array>` [for some 
reasons](https://github.com/jorgecarleitao/arrow2/tree/main/src/compute#design);
 we use `result.into()` to convert to `Arc`. This is very cheap and the best 
that arrow has to offer without the unstable channel.
   
   * I removed `SchemaRef` and `ArrayRef` from `arrow2` because they are only 
relevant in the context of DataFusion, and replaced them by `type SchemaRef = 
Arc<Schema>` on DataFusion. We could also revert this on `arrow2`.
   
   * There are some changes in `min` and `max` for floats. Essentially, 
`arrow2` guarantees that the comparison operator used in `sort` for floats is 
the same as the one used in `min/max`, which required this small change due to 
`Ord` for floats still be part of unstable rust.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jorgecarleitao opened a new pull request #68: Experimenting with arrow2

Reply via email to