mrbrahman opened a new issue, #40958: URL: https://github.com/apache/arrow/issues/40958
### Describe the enhancement requested Hi, One of the design principles of parquet from their Github page is '[Separating metadata and column data](https://github.com/apache/parquet-format/tree/master?tab=readme-ov-file#separating-metadata-and-column-data)': > # Separating metadata and column data. > > The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. In order to achieve the 'columns in different files', we need to 1. Ensure each file has the same number of row-groups 2. Ensure each corresponding row-group of each file have the same rows 3. Grab the 'metadata' from each file, '**zip/attach them vertically**', and write out the new metadata file 4. Feed this metadata while reading the table The Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above. This ticket is requesting the addition of a new API to be able to 'join' metadata from 2 files. For e.g. ~~~python import pyarrow.parquet as pq m1 = pq.read_metadata('file1.parquet') # say this has columns: col1, col2, col3 m1.set_file_path('file1.parquet') m2 = pq.read_metadata('file2.parquet') # say this has columns: col4, col5 m2.set_file_path('file2.parquet') # requesting this new 'zip' API m = m1.zip(m2) # needs to ensure same number of row groups, and same number of rows within each row group # m will now have col1, col2, col3, col4, col5 each pointing to appropriate data file m.write_metadata('_metadata') ~~~ ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org