mrbrahman opened a new issue, #40958:
URL: https://github.com/apache/arrow/issues/40958

   ### Describe the enhancement requested
   
   Hi,
   
   One of the design principles of parquet from their Github page is 
'[Separating metadata and column 
data](https://github.com/apache/parquet-format/tree/master?tab=readme-ov-file#separating-metadata-and-column-data)':
   
   > # Separating metadata and column data.
   > 
   > The format is explicitly designed to separate the metadata from the data. 
This allows splitting columns into multiple files, as well as having a single 
metadata file reference multiple parquet files.
   
   In order to achieve the 'columns in different files', we need to
   
   1. Ensure each file has the same number of row-groups
   2. Ensure each corresponding row-group of each file have the same rows
   3. Grab the 'metadata' from each file, '**zip/attach them vertically**', and 
write out the new metadata file
   4. Feed this metadata while reading the table
   
   The Arrow APIs provide nearly everything to achieve this, except for the 
bolded portion in point 3 above.
   
   This ticket is requesting the addition of a new API to be able to 'join' 
metadata from 2 files.
   
   For e.g.
   
   ~~~python
   import pyarrow.parquet as pq
   m1 = pq.read_metadata('file1.parquet')  # say this has columns: col1, col2, 
col3
   m1.set_file_path('file1.parquet')
   
   m2 = pq.read_metadata('file2.parquet')  # say this has columns: col4, col5
   m2.set_file_path('file2.parquet')
   
   # requesting this new 'zip' API
   m = m1.zip(m2)  # needs to ensure same number of row groups, and same number 
of rows within each row group
   
   # m will now have col1, col2, col3, col4, col5 each pointing to appropriate 
data file
   
   m.write_metadata('_metadata')
   
   ~~~
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to