mapleFU commented on PR #38581:
URL: https://github.com/apache/arrow/pull/38581#issuecomment-1793494891

   Sorry for late replying because I'm playing games :-)
   
   Schema is an important part when using parquet, because **parquet only store 
leaf-nodes**.  You can also refer to the code here: 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h#L142-L153
   
   ```
     /// \brief Read column as a whole into a chunked array.
     ///
     /// The index i refers the index of the top level schema field, which may
     /// be nested or flat - e.g.
     ///
     /// 0 foo.bar
     ///   foo.bar.baz
     ///   foo.qux
     /// 1 foo2
     /// 2 foo3
     ///
     /// i=0 will read the entire foo struct, i=1 the foo2 primitive column etc
   ```
   
   So, in case above, the parquet "real" schema  would be:
   
   ```
   foo.bar.baz
   foo.qux
   foo2
   foo3
   ```
   
   You can also read some comments in 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h . It 
would helps a lot.
   
   `ArrowColumnWriter` needs mapping the arrow writer (like writer for `foo` ) 
to underlying leaf writer (writer for `foo.bar.baz` and `foo.qux` ). In this 
case, the `foo` writer will write 2 leaf node, and next time, the  
`ArrowColumnWriter` should accept `2` here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to