[I] API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

via GitHub Tue, 02 Apr 2024 13:35:23 -0700


mrbrahman opened a new issue, #40958:
URL: https://github.com/apache/arrow/issues/40958

### Describe the enhancement requested

Hi,

One of the design principles of parquet from their Github page is
'[Separating metadata and column
data](https://github.com/apache/parquet-format/tree/master?tab=readme-ov-file#separating-metadata-and-column-data)':

> # Separating metadata and column data.
>
> The format is explicitly designed to separate the metadata from the data.
This allows splitting columns into multiple files, as well as having a single
metadata file reference multiple parquet files.

In order to achieve the 'columns in different files', we need to

1. Ensure each file has the same number of row-groups
2. Ensure each corresponding row-group of each file have the same rows
3. Grab the 'metadata' from each file, '**zip/attach them vertically**', and
write out the new metadata file
4. Feed this metadata while reading the table

The Arrow APIs provide nearly everything to achieve this, except for the
bolded portion in point 3 above.

This ticket is requesting the addition of a new API to be able to 'join'
metadata from 2 files.

For e.g.

~~~python
import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet') # say this has columns: col1, col2,
col3
m1.set_file_path('file1.parquet')

m2 = pq.read_metadata('file2.parquet') # say this has columns: col4, col5
m2.set_file_path('file2.parquet')

# requesting this new 'zip' API
m = m1.zip(m2) # needs to ensure same number of row groups, and same number
of rows within each row group

# m will now have col1, col2, col3, col4, col5 each pointing to appropriate
data file

m.write_metadata('_metadata')

~~~

### Component(s)

C++, Python

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] API to 'zip' or (vertically) 'attach' parquet metadata [arrow]

Reply via email to