[
https://issues.apache.org/jira/browse/ARROW-18163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Miles Granger updated ARROW-18163:
----------------------------------
Description:
Context: we're creating a new data format for computer vision
(https://github.com/eto-ai/lance) with a C++ core.
We've implemented the integration so you can read Lance datasets into pyarrow
like:
{code:python}
import lance
import pyarrow.dataset as ds
ds.dataset(uri, format=lance.LanceFileFormat())
{code}
Would it possible to create a file format registry? like:
{code:python}
ds.register_file_format(
ext='lance',
format=lance.LanceFileFormat(),
dataset=lance.FileSystemDataset
)
{code}
which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
The optional third argument would be to help expose format specific
optimizations. e.g, Lance has much better random access performance so pushing
limit/offset parameters to Lance allows for much faster paging, especially over
deeply nested data and/or image blobs.
Thanks!
was:
Context: we're creating a new data format for computer vision
(https://github.com/eto-ai/lance) with a C++ core.
We've implemented the integration so you can read Lance datasets into pyarrow
like:
```python
import lance
import pyarrow.dataset as ds
ds.dataset(uri, format=lance.LanceFileFormat())
```
Would it possible to create a file format registry? like:
```python
ds.register_file_format(
ext='lance',
format=lance.LanceFileFormat(),
dataset=lance.FileSystemDataset
)
```
which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
The optional third argument would be to help expose format specific
optimizations. e.g, Lance has much better random access performance so pushing
limit/offset parameters to Lance allows for much faster paging, especially over
deeply nested data and/or image blobs.
Thanks!
> [Python] registering new data formats
> -------------------------------------
>
> Key: ARROW-18163
> URL: https://issues.apache.org/jira/browse/ARROW-18163
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Chang She
> Priority: Major
>
> Context: we're creating a new data format for computer vision
> (https://github.com/eto-ai/lance) with a C++ core.
> We've implemented the integration so you can read Lance datasets into pyarrow
> like:
> {code:python}
> import lance
> import pyarrow.dataset as ds
> ds.dataset(uri, format=lance.LanceFileFormat())
> {code}
> Would it possible to create a file format registry? like:
> {code:python}
> ds.register_file_format(
> ext='lance',
> format=lance.LanceFileFormat(),
> dataset=lance.FileSystemDataset
> )
> {code}
> which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
> The optional third argument would be to help expose format specific
> optimizations. e.g, Lance has much better random access performance so
> pushing limit/offset parameters to Lance allows for much faster paging,
> especially over deeply nested data and/or image blobs.
> Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)