Chang She created ARROW-18163:
---------------------------------
Summary: [python] registering new data formats
Key: ARROW-18163
URL: https://issues.apache.org/jira/browse/ARROW-18163
Project: Apache Arrow
Issue Type: New Feature
Components: Python
Affects Versions: 9.0.0
Reporter: Chang She
Context: we're creating a new data format for computer vision
(https://github.com/eto-ai/lance) with a C++ core.
We've implemented the integration so you can read Lance datasets into pyarrow
like:
```python
import lance
import pyarrow.dataset as ds
ds.dataset(uri, format=lance.LanceFileFormat())
```
Would it possible to create a file format registry? like:
```python
ds.register_file_format(
ext='lance',
format=lance.LanceFileFormat(),
dataset=lance.FileSystemDataset
)
```
which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
The optional third argument would be to help expose format specific
optimizations. e.g, Lance has much better random access performance so pushing
limit/offset parameters to Lance allows for much faster paging, especially over
deeply nested data and/or image blobs.
Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)