mapleFU commented on PR #38581: URL: https://github.com/apache/arrow/pull/38581#issuecomment-1793494891
Sorry for late replying because I'm playing games :-) Schema is an important part when using parquet, because **parquet only store leaf-nodes**. You can also refer to the code here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h#L142-L153 ``` /// \brief Read column as a whole into a chunked array. /// /// The index i refers the index of the top level schema field, which may /// be nested or flat - e.g. /// /// 0 foo.bar /// foo.bar.baz /// foo.qux /// 1 foo2 /// 2 foo3 /// /// i=0 will read the entire foo struct, i=1 the foo2 primitive column etc ``` So, in case above, the parquet "real" schema would be: ``` foo.bar.baz foo.qux foo2 foo3 ``` You can also read some comments in https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h . It would helps a lot. `ArrowColumnWriter` needs mapping the arrow writer (like writer for `foo` ) to underlying leaf writer (writer for `foo.bar.baz` and `foo.qux` ). In this case, the `foo` writer will write 2 leaf node, and next time, the `ArrowColumnWriter` should accept `2` here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
