Chang She created ARROW-18163: --------------------------------- Summary: [python] registering new data formats Key: ARROW-18163 URL: https://issues.apache.org/jira/browse/ARROW-18163 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 9.0.0 Reporter: Chang She
Context: we're creating a new data format for computer vision (https://github.com/eto-ai/lance) with a C++ core. We've implemented the integration so you can read Lance datasets into pyarrow like: ```python import lance import pyarrow.dataset as ds ds.dataset(uri, format=lance.LanceFileFormat()) ``` Would it possible to create a file format registry? like: ```python ds.register_file_format( ext='lance', format=lance.LanceFileFormat(), dataset=lance.FileSystemDataset ) ``` which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully. The optional third argument would be to help expose format specific optimizations. e.g, Lance has much better random access performance so pushing limit/offset parameters to Lance allows for much faster paging, especially over deeply nested data and/or image blobs. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)