[GitHub] [arrow] swyatt7 commented on issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

via GitHub Thu, 02 Mar 2023 22:21:51 -0800


swyatt7 commented on issue #34403:
URL: https://github.com/apache/arrow/issues/34403#issuecomment-1453038670


   Hello and thanks for the prompt reply.
   
   I implemented what you said and was able to get it to work :). I did run 
into a few tweaks that needs to be implemented that aren't well documented. 
Doing this, you'll have to set each partitioned file metadata's file-path 
before sending it into the `metadata_collector` with the `.set_file_path()`
   
   ```
   root_dir = /path/to/partitioned/dataset
   dataset = ps.dataset(root_dir, partitioning='hive', format='parquet')
   metadata_collector = []
           
   for f in dataset.files:
         md = pq.read_metadata(f)
         md.set_file_path(f.split(f'{root_dir}/')[1])
         metadata_collector.append(md)
   
   _meta_data_path = os.path.join(root_dir, '_metadata')
   _common_metadata_path = os.path.join(root_dir, '_common_metadata')
   
   pq.write_metadata(dataset.schema, _meta_data_path, 
metadata_collector=metadata_collector)
   pq.write_metadata(dataset.schema, _common_metadata_path)
   ```
   
   Another issue I ran into was that if a partition contains an empty parquet 
file (with only the header information), the `append_row_groups` will throw an 
error saying that the schemas don't match... even though the schemas do match. 
One is just an empty file. Once I got rid of those files, the metadata was 
written correctly.
   
   Again thanks for the help, I'll mark as closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] swyatt7 commented on issue #34403: [Python] Is there a way to construct the metadata_collector for an existing partitioned dataset?

Reply via email to