[
https://issues.apache.org/jira/browse/ARROW-18163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Miles Granger updated ARROW-18163:
----------------------------------
Summary: [Python] registering new data formats (was: [python] registering
new data formats)
> [Python] registering new data formats
> -------------------------------------
>
> Key: ARROW-18163
> URL: https://issues.apache.org/jira/browse/ARROW-18163
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 9.0.0
> Reporter: Chang She
> Priority: Major
>
> Context: we're creating a new data format for computer vision
> (https://github.com/eto-ai/lance) with a C++ core.
> We've implemented the integration so you can read Lance datasets into pyarrow
> like:
> ```python
> import lance
> import pyarrow.dataset as ds
> ds.dataset(uri, format=lance.LanceFileFormat())
> ```
> Would it possible to create a file format registry? like:
> ```python
> ds.register_file_format(
> ext='lance',
> format=lance.LanceFileFormat(),
> dataset=lance.FileSystemDataset
> )
> ```
> which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
> The optional third argument would be to help expose format specific
> optimizations. e.g, Lance has much better random access performance so
> pushing limit/offset parameters to Lance allows for much faster paging,
> especially over deeply nested data and/or image blobs.
> Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)