Chang She created ARROW-18163:
---------------------------------

             Summary: [python] registering new data formats
                 Key: ARROW-18163
                 URL: https://issues.apache.org/jira/browse/ARROW-18163
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
    Affects Versions: 9.0.0
            Reporter: Chang She


Context: we're creating a new data format for computer vision 
(https://github.com/eto-ai/lance) with a C++ core.

We've implemented the integration so you can read Lance datasets into pyarrow 
like:

```python
import lance
import pyarrow.dataset as ds

ds.dataset(uri, format=lance.LanceFileFormat())
```

Would it possible to create a file format registry? like: 

```python
ds.register_file_format(
    ext='lance', 
    format=lance.LanceFileFormat(),
    dataset=lance.FileSystemDataset
)
```

which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.


The optional third argument would be to help expose format specific 
optimizations. e.g, Lance has much better random access performance so pushing 
limit/offset parameters to Lance allows for much faster paging, especially over 
deeply nested data and/or image blobs.

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to