Re: [PR] Document Arrow <--> Parquet schema conversion better [arrow-rs]

via GitHub Wed, 07 May 2025 07:57:19 -0700


alamb commented on code in PR #7479:
URL: https://github.com/apache/arrow-rs/pull/7479#discussion_r2077838752



##########
parquet/src/arrow/mod.rs:
##########
@@ -15,13 +15,41 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! API for reading/writing
-//! Arrow [RecordBatch](arrow_array::RecordBatch)es and
-//! [Array](arrow_array::Array)s to/from Parquet Files.
+//! API for reading/writing Arrow [`RecordBatch`]es and [`Array`]s to/from
+//! Parquet Files.
 //!
-//! See the [crate-level documentation](crate) for more details.
+//! See the [crate-level documentation](crate) for more details on other APIs
 //!
-//! # Example of writing Arrow record batch to Parquet file
+//! # Schema Conversion
+//!
+//! These APIs ensure that data in Arrow [`RecordBatch`]es written to Parquet 
are
+//! read back as [`RecordBatch`]es with the exact same types and values.
+//!
+//! Parquet and Arrow have different type systems, and there is not
+//! always a one to one mapping between the systems. For example, data
+//! stored as a Parquet [`BYTE_ARRAY`] can be read as either an Arrow
+//! [`BinaryViewArray`] or [`BinaryArray`].
+//!
+//! To recover the original Arrow types, the writers in this module add
+//! metadata in the [`ARROW_SCHEMA_META_KEY`] key to record the original Arrow

Review Comment:
   I was reminded on https://github.com/apache/arrow-rs/pull/5626 that this 
metadata is the same format as used by arrow-cpp, which is an important caveat. 
I will add to this doc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Document Arrow <--> Parquet schema conversion better [arrow-rs]

Reply via email to