jorgecarleitao opened a new pull request #68: URL: https://github.com/apache/arrow-datafusion/pull/68
I have been experimenting using arrow2 & parquet2 as backends for DataFusion. This is WIP and does not compile, but I would like to give some visibility to this work. So far, I was able to keep all functionality, with a guaranteed increase in security and potentially some performance. Goals: * compile * tests pass * re-write readers and writers to leverage parallelism of both `parquet2` and the CSV reader in arrow2 * ??? * profit ------------- Some notes: * I removed `CastOptions` because casting does not need to be fallible; we can make any non-castable null and recover the set of failed casts from the differences in validity between the original array and the casted array, if so we wish. * most kernels in Arrow2 return a `Box<dyn Array>` [for some reasons](https://github.com/jorgecarleitao/arrow2/tree/main/src/compute#design); we use `result.into()` to convert to `Arc`. This is very cheap and the best that arrow has to offer without the unstable channel. * I removed `SchemaRef` and `ArrayRef` from `arrow2` because they are only relevant in the context of DataFusion, and replaced them by `type SchemaRef = Arc<Schema>` on DataFusion. We could also revert this on `arrow2`. * There are some changes in `min` and `max` for floats. Essentially, `arrow2` guarantees that the comparison operator used in `sort` for floats is the same as the one used in `min/max`, which required this small change due to `Ord` for floats still be part of unstable rust. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
