[ 
https://issues.apache.org/jira/browse/PARQUET-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339290#comment-17339290
 ] 

Weston Pace commented on PARQUET-1798:
--------------------------------------

Is there something wrong with the generation algorithm?  Or is it more that 
there is no way to override the generation algorithm to use existing field IDs?

For example, it would be pretty straightforward to use the existing algorithm 
when going from Arrow->Parquet if no field id exists in the Arrow metadata.  On 
the other hand, the field ID algorithm could be updated so a column name (or 
column path for nested columns) will always generate the same ID (unless there 
are multiple columns with the same name).

Does the following make sense (independent of which generation algorithm we 
use):

 

Parquet -> Arrow

 * If field_id is set use that

 * If field_id is not set use generation algorithm

Arrow -> Parquet

 * If field_id is set use that

 * If field_id is not set use generation algorithm

 

That should be round-trippable since all subsequent read/writes after the first 
re-use existing field_ids.

> [C++] Review logic around automatic assignment of field_id's
> ------------------------------------------------------------
>
>                 Key: PARQUET-1798
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1798
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: cpp-4.0.0
>
>
> At schema deserialization (from Thrift) time, we are assigning a default 
> field_id to the Schema node based on a depth-first ordering of notes. This 
> means that a round trip (load, then write) will cause field_id's to be 
> written that weren't there before. I'm not sure this is the desired behavior.
> We should examine this in more detail and possible change it. See also 
> discussion in ARROW-7080 https://github.com/apache/arrow/pull/6408



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to