[
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845947#comment-16845947
]
Martin Durant commented on ARROW-5349:
--------------------------------------
> in which this would be wrong if it is inside the file itself
Agreed, the path would be wrong. Even in the simpler case, above, you could say
it was wrong based on the thrift template - and this could make sense, as it
maybe implies opening a new file.
> [Python/C++] Provide a way to specify the file path in parquet
> ColumnChunkMetaData
> ----------------------------------------------------------------------------------
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 2h
> Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now
> possible to collect the file metadata while writing different files (then how
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData
> object is not set. This is, I think, expected / correct for the metadata as
> included within the single file; but for using the metadata in the combined
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to
> be a way to specify this file path.
> Ideas I am thinking of currently: either, we could specify a file path to be
> used when writing, or expose the `set_file_path` method on the Python side so
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)