I submitted a PR (for a possible solution) that permits setting the file_path fields to a particular value after a file has already been written. That way this mutated FileMetaData can be written to a separate _metadata file.
https://github.com/apache/arrow/pull/4386 On Fri, May 24, 2019 at 12:22 PM Ryan Blue <[email protected]> wrote: > > Hi Richard, > > I think the original intent of file_path was to allow storing columns in > other data files. But Parquet has never really had support for this so the > column is ignored as far as I know. I see no problem with setting it to the > same file, but it is probably more reliable not to so that your metadata > isn't incorrect when you rename or move the file. > > rb > > On Wed, May 22, 2019 at 8:33 PM Richard Zamora <[email protected]> wrote: > > > I’d like to solicit some feedback on the use of the `file_path` attribute > > for ColumnChunk metadata in Parquet. How exactly is this attribute used in > > practice for both single-file and distributed datasets? > > > > More specifically: Is it bad form to set the `file_path` value in footer > > metadata when the data is stored in the same file? Should the value only > > be set in the `_metadat` file, or in cases where the actual column-chunk > > data is stored in a different location? My intuition is that the answer to > > both of these questions is “yes,” but any feedback/details from people > > with strong parquet experience is very welcome :) > > > > Note that the context for these questions is an ongoing discussion about > > the necessary metadata API in `arrow.parquet` (e.g. > > https://github.com/apache/arrow/pull/4361 and > > https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16845670#comment-16845670 > > ) > > > > Thanks for your help! > > -Rick > > > > > > ----------------------------------------------------------------------------------- > > This email message is for the sole use of the intended recipient(s) and > > may contain > > confidential information. Any unauthorized review, use, disclosure or > > distribution > > is prohibited. If you are not the intended recipient, please contact the > > sender by > > reply email and destroy all copies of the original message. > > > > ----------------------------------------------------------------------------------- > > > > > -- > Ryan Blue > Software Engineer > Netflix
