[GitHub] [arrow-rs] alamb commented on a change in pull request #797: Add Parquet writer example to docs

GitBox Sun, 26 Sep 2021 03:32:05 -0700


alamb commented on a change in pull request #797:
URL: https://github.com/apache/arrow-rs/pull/797#discussion_r715711686




##########
File path: parquet/src/arrow/mod.rs
##########
@@ -79,8 +52,57 @@
 //!     writer.write(&batch).expect("Writing batch");
 //! }
 //! writer.close().unwrap();
+//! ```
+
+//! `WriterProperties` can be used to set several configuration options
+//! ```rust, no_run
+//! use parquet::basic::{ Compression, Encoding };
+//! // File compression
+//! let props = WriterProperties::builder()
+//!     .set_compression(Compression::SNAPPY)
+//!     .build();
+//! // Max row group size compression
+//! let props = WriterProperties::builder()
+//!     .set_max_row_group_size(100)
+//!     .build();
+//! // File encoding
+//! let props = WriterProperties::builder()
+//!     .set_encoding(Encoding::RLE)
+//!     .build();
+//! // Parquet Version
+//! let props = WriterProperties::builder()
+//!     .set_writer_version(WriterVersion::PARQUET_1_0)
+//!     .build();
+//! ```
+//!
+//! # Example of reading parquet file into arrow record batch
+//!
+//! ```rust, no_run
+//! use arrow::record_batch::RecordBatchReader;
+//! use parquet::file::reader::SerializedFileReader;
+//! use parquet::arrow::{ParquetFileArrowReader, ArrowReader};
+//! use std::sync::Arc;
+//! use std::fs::File;
 //!
+//! let file = File::open("data.parquet").unwrap();
+//! let file_reader = SerializedFileReader::new(file).unwrap();

Review comment:
       > Perhaps I am missing something. My understanding is that part of the 
code in the reader example is meant to demonstrate reading an on disk parquet 
file - hence the need to use SerializedFileReader. Is this understanding 
correct?
   
   I am probably confused. I was imagining that the example for 
~`SerializedFileReader`~  `ArrowReader` would demonstrate reading a parquet 
file and that a (separate) example for ~`SerializedFileWriter`~ `ArrorWriter` 
would demonstrate writing a `RecordBatch` (created somehow) to a file (as I 
think that is the common usecase).
   
   > Assuming thats the case, can you just confirm that try_from_iter is the 
preferred approach to creating a record batch over try_new?
   
   I don't think one is preferred over the other. I find the code to create 
`RecordBatch`es from `try_from_iter` is slightly shorter but they both do the 
same thing so I think either is fine




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on a change in pull request #797: Add Parquet writer example to docs

Reply via email to