Re: [PR] GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT [arrow]

via GitHub Thu, 11 Jan 2024 10:19:14 -0800


pitrou commented on code in PR #39570:
URL: https://github.com/apache/arrow/pull/39570#discussion_r1449237601



##########
cpp/src/parquet/reader_test.cc:
##########
@@ -120,11 +120,27 @@ std::string concatenated_gzip_members() {
   return data_file("concatenated_gzip_members.parquet");
 }
 
+std::string byte_stream_split() { return 
data_file("byte_stream_split.zstd.parquet"); }
+
+template <typename DType, typename ValueType = typename DType::c_type>
+std::vector<ValueType> ReadColumnValues(ParquetFileReader* file_reader, int 
row_group,
+                                        int column, int64_t 
expected_values_read) {
+  auto column_reader = checked_pointer_cast<TypedColumnReader<DType>>(
+      file_reader->RowGroup(row_group)->Column(column));
+  std::vector<ValueType> values(expected_values_read);
+  int64_t values_read;
+  auto levels_read = column_reader->ReadBatch(expected_values_read, nullptr, 
nullptr,

Review Comment:
   I see. Well, it may have definition levels but it certainly has zero nulls, 
given how it was generated :-)
   
https://github.com/apache/parquet-testing/blob/master/data/README.md#byte-stream-split
   
   If I generate the data directly, I get the same values too:
   ```python
   >>> 
   ...: np.random.seed(0)
   ...: table = pa.Table.from_pydict({
   ...:   'f32': np.random.normal(size=300).astype(np.float32),
   ...:   'f64': np.random.normal(size=300).astype(np.float64),
   ...: })
   >>> table
   pyarrow.Table
   f32: float
   f64: double
   ----
   f32: 
[[1.7640524,0.4001572,0.978738,2.2408931,1.867558,...,1.1368914,0.09772497,0.5829537,-0.39944902,0.37005588]]
   f64: 
[[-1.3065268517353166,1.658130679618188,-0.11816404512856976,-0.6801782039968504,0.6663830820319143,...,0.37923553353558676,-0.4700328827008748,-0.21673147057553863,-0.9301565025243212,-0.17858909208732915]]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-39560: [C++][Parquet] Add integration test for BYTE_STREAM_SPLIT [arrow]

Reply via email to