Hi Richard, I think the original intent of file_path was to allow storing columns in other data files. But Parquet has never really had support for this so the column is ignored as far as I know. I see no problem with setting it to the same file, but it is probably more reliable not to so that your metadata isn't incorrect when you rename or move the file.
rb On Wed, May 22, 2019 at 8:33 PM Richard Zamora <[email protected]> wrote: > I’d like to solicit some feedback on the use of the `file_path` attribute > for ColumnChunk metadata in Parquet. How exactly is this attribute used in > practice for both single-file and distributed datasets? > > More specifically: Is it bad form to set the `file_path` value in footer > metadata when the data is stored in the same file? Should the value only > be set in the `_metadat` file, or in cases where the actual column-chunk > data is stored in a different location? My intuition is that the answer to > both of these questions is “yes,” but any feedback/details from people > with strong parquet experience is very welcome :) > > Note that the context for these questions is an ongoing discussion about > the necessary metadata API in `arrow.parquet` (e.g. > https://github.com/apache/arrow/pull/4361 and > https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16845670#comment-16845670 > ) > > Thanks for your help! > -Rick > > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > ----------------------------------------------------------------------------------- > -- Ryan Blue Software Engineer Netflix
