This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/master by this push:
new f2de2cd3c8 No longer write Parquet column metadata after column chunks
*and* in the footer (#6117)
f2de2cd3c8 is described below
commit f2de2cd3c8a8496e1f96e58b50786227bfe915f7
Author: Ed Seidl <[email protected]>
AuthorDate: Fri Aug 2 11:05:01 2024 -0700
No longer write Parquet column metadata after column chunks *and* in the
footer (#6117)
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`
Signed-off-by: Bugen Zhao <[email protected]>
* fix example tests
Signed-off-by: Bugen Zhao <[email protected]>
---------
Signed-off-by: Bugen Zhao <[email protected]>
* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily
accidentally copies data (#6043)
* deprecate auto copy, ask explicit reference
* update comments
* make cargo doc happy
* Make display of interval types more pretty (#6006)
* improve dispaly for interval.
* update test in pretty, and fix display problem.
* tmp
* fix tests in arrow-cast.
* fix tests in pretty.
* fix style.
* Update snafu (#5930)
* Update Parquet thrift generated structures (#6045)
* update to latest thrift (as of 11 Jul 2024) from parquet-format
* pass None for optional size statistics
* escape HTML tags
* don't need to escape brackets in arrays
* Revert "Revert "Write Bloom filters between row groups instead of the end
(#…" (#5933)
This reverts commit 22e0b4432c9838f2536284015271d3de9a165135.
* Revert "Update snafu (#5930)" (#6069)
This reverts commit 756b1fb26d1702f36f446faf9bb40a4869c3e840.
* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)
* Update pyo3 requirement from 0.21.1 to 0.22.1
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit
the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.1)
---
updated-dependencies:
- dependency-name: pyo3
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <[email protected]>
* refactor: remove deprecated `FromPyArrow::from_pyarrow`
"GIL Refs" are being phased out.
* chore: update `pyo3` in integration tests
---------
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot]
<49699333+dependabot[bot]@users.noreply.github.com>
* remove repeated codes to make the codes more concise. (#6080)
* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)
* update to latest thrift (as of 11 Jul 2024) from parquet-format
* pass None for optional size statistics
* escape HTML tags
* don't need to escape brackets in arrays
* add support for unencoded_byte_array_data_bytes
* add comments
* change sig of ColumnMetrics::update_variable_length_bytes()
* rename ParquetOffsetIndex to OffsetSizeIndex
* rename some functions
* suggestion from review
Co-authored-by: Andrew Lamb <[email protected]>
* add Default trait to ColumnMetrics as suggested in review
* rename OffsetSizeIndex to OffsetIndexMetaData
---------
Co-authored-by: Andrew Lamb <[email protected]>
* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit
the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](https://github.com/pyo3/pyo3/compare/v0.21.1...v0.22.2)
---
updated-dependencies:
- dependency-name: pyo3
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot]
<49699333+dependabot[bot]@users.noreply.github.com>
* Deprecate read_page_locations() and simplify offset index in
`ParquetMetaData` (#6095)
* deprecate read_page_locations
* add to_thrift() to OffsetIndexMetaData
* no longer write inline column metadata
* Update parquet/src/column/writer/mod.rs
Co-authored-by: Ed Seidl <[email protected]>
* suggestion from review
Co-authored-by: Andrew Lamb <[email protected]>
* add some more documentation
* remove write_metadata from PageWriter
---------
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Bugen Zhao <[email protected]>
Co-authored-by: Xiangpeng Hao <[email protected]>
Co-authored-by: kamille <[email protected]>
Co-authored-by: Jesse <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Marco Neumann <[email protected]>
Co-authored-by: dependabot[bot]
<49699333+dependabot[bot]@users.noreply.github.com>
---
parquet/src/arrow/arrow_writer/mod.rs | 7 +------
parquet/src/column/page.rs | 8 +-------
parquet/src/column/writer/mod.rs | 18 +++---------------
parquet/src/file/metadata/mod.rs | 16 ++++++++++++++--
parquet/src/file/writer.rs | 12 ------------
5 files changed, 19 insertions(+), 42 deletions(-)
diff --git a/parquet/src/arrow/arrow_writer/mod.rs
b/parquet/src/arrow/arrow_writer/mod.rs
index 8f7b514ccf..db5d4502cf 100644
--- a/parquet/src/arrow/arrow_writer/mod.rs
+++ b/parquet/src/arrow/arrow_writer/mod.rs
@@ -43,7 +43,7 @@ use crate::column::writer::{
};
use crate::data_type::{ByteArray, FixedLenByteArray};
use crate::errors::{ParquetError, Result};
-use crate::file::metadata::{ColumnChunkMetaData, KeyValue, RowGroupMetaData};
+use crate::file::metadata::{KeyValue, RowGroupMetaData};
use crate::file::properties::{WriterProperties, WriterPropertiesPtr};
use crate::file::reader::{ChunkReader, Length};
use crate::file::writer::{SerializedFileWriter, SerializedRowGroupWriter};
@@ -489,11 +489,6 @@ impl PageWriter for ArrowPageWriter {
Ok(spec)
}
- fn write_metadata(&mut self, _metadata: &ColumnChunkMetaData) ->
Result<()> {
- // Skip writing metadata as won't be copied anyway
- Ok(())
- }
-
fn close(&mut self) -> Result<()> {
Ok(())
}
diff --git a/parquet/src/column/page.rs b/parquet/src/column/page.rs
index 947a633f48..585d1951c1 100644
--- a/parquet/src/column/page.rs
+++ b/parquet/src/column/page.rs
@@ -21,7 +21,7 @@ use bytes::Bytes;
use crate::basic::{Encoding, PageType};
use crate::errors::{ParquetError, Result};
-use crate::file::{metadata::ColumnChunkMetaData, statistics::Statistics};
+use crate::file::statistics::Statistics;
use crate::format::PageHeader;
/// Parquet Page definition.
@@ -350,12 +350,6 @@ pub trait PageWriter: Send {
/// either data page or dictionary page.
fn write_page(&mut self, page: CompressedPage) -> Result<PageWriteSpec>;
- /// Writes column chunk metadata into the output stream/sink.
- ///
- /// This method is called once before page writer is closed, normally when
writes are
- /// finalised in column writer.
- fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()>;
-
/// Closes resources and flushes underlying sink.
/// Page writer should not be used after this method is called.
fn close(&mut self) -> Result<()>;
diff --git a/parquet/src/column/writer/mod.rs b/parquet/src/column/writer/mod.rs
index 54d8fd3cc1..519a219f94 100644
--- a/parquet/src/column/writer/mod.rs
+++ b/parquet/src/column/writer/mod.rs
@@ -579,7 +579,7 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
self.write_dictionary_page()?;
}
self.flush_data_pages()?;
- let metadata = self.write_column_metadata()?;
+ let metadata = self.build_column_metadata()?;
self.page_writer.close()?;
let boundary_order = match (
@@ -1041,8 +1041,8 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a,
E> {
Ok(())
}
- /// Assembles and writes column chunk metadata.
- fn write_column_metadata(&mut self) -> Result<ColumnChunkMetaData> {
+ /// Assembles column chunk metadata.
+ fn build_column_metadata(&mut self) -> Result<ColumnChunkMetaData> {
let total_compressed_size = self.column_metrics.total_compressed_size
as i64;
let total_uncompressed_size =
self.column_metrics.total_uncompressed_size as i64;
let num_values = self.column_metrics.total_num_values as i64;
@@ -1050,15 +1050,9 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a,
E> {
// If data page offset is not set, then no pages have been written
let data_page_offset =
self.column_metrics.data_page_offset.unwrap_or(0) as i64;
- let file_offset = match dict_page_offset {
- Some(dict_offset) => dict_offset + total_compressed_size,
- None => data_page_offset + total_compressed_size,
- };
-
let mut builder = ColumnChunkMetaData::builder(self.descr.clone())
.set_compression(self.codec)
.set_encodings(self.encodings.iter().cloned().collect())
- .set_file_offset(file_offset)
.set_total_compressed_size(total_compressed_size)
.set_total_uncompressed_size(total_uncompressed_size)
.set_num_values(num_values)
@@ -1138,8 +1132,6 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a,
E> {
}
let metadata = builder.build()?;
- self.page_writer.write_metadata(&metadata)?;
-
Ok(metadata)
}
@@ -3589,10 +3581,6 @@ mod tests {
Ok(res)
}
- fn write_metadata(&mut self, _metadata: &ColumnChunkMetaData) ->
Result<()> {
- Ok(())
- }
-
fn close(&mut self) -> Result<()> {
Ok(())
}
diff --git a/parquet/src/file/metadata/mod.rs b/parquet/src/file/metadata/mod.rs
index d99cd95103..86c673bbdb 100644
--- a/parquet/src/file/metadata/mod.rs
+++ b/parquet/src/file/metadata/mod.rs
@@ -682,7 +682,12 @@ impl ColumnChunkMetaData {
self.file_path.as_deref()
}
- /// Byte offset in `file_path()`.
+ /// Byte offset of `ColumnMetaData` in `file_path()`.
+ ///
+ /// Note that the meaning of this field has been inconsistent between
implementations
+ /// so its use has since been deprecated in the Parquet specification.
Modern implementations
+ /// will set this to `0` to indicate that the `ColumnMetaData` is solely
contained in the
+ /// `ColumnChunk` struct.
pub fn file_offset(&self) -> i64 {
self.file_offset
}
@@ -1040,6 +1045,14 @@ impl ColumnChunkMetaDataBuilder {
}
/// Sets file offset in bytes.
+ ///
+ /// This field was meant to provide an alternate to storing
`ColumnMetadata` directly in
+ /// the `ColumnChunkMetadata`. However, most Parquet readers assume the
`ColumnMetadata`
+ /// is stored inline and ignore this field.
+ #[deprecated(
+ since = "53.0.0",
+ note = "The Parquet specification requires this field to be 0"
+ )]
pub fn set_file_offset(mut self, value: i64) -> Self {
self.0.file_offset = value;
self
@@ -1453,7 +1466,6 @@ mod tests {
let col_metadata = ColumnChunkMetaData::builder(column_descr.clone())
.set_encodings(vec![Encoding::PLAIN, Encoding::RLE])
.set_file_path("file_path".to_owned())
- .set_file_offset(100)
.set_num_values(1000)
.set_compression(Compression::SNAPPY)
.set_total_compressed_size(2000)
diff --git a/parquet/src/file/writer.rs b/parquet/src/file/writer.rs
index 89aaf028d1..bcf8449b44 100644
--- a/parquet/src/file/writer.rs
+++ b/parquet/src/file/writer.rs
@@ -649,13 +649,10 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W>
{
));
}
- let file_offset = self.buf.bytes_written() as i64;
-
let map_offset = |x| x - src_offset + write_offset as i64;
let mut builder =
ColumnChunkMetaData::builder(metadata.column_descr_ptr())
.set_compression(metadata.compression())
.set_encodings(metadata.encodings().clone())
- .set_file_offset(file_offset)
.set_total_compressed_size(metadata.compressed_size())
.set_total_uncompressed_size(metadata.uncompressed_size())
.set_num_values(metadata.num_values())
@@ -680,7 +677,6 @@ impl<'a, W: Write + Send> SerializedRowGroupWriter<'a, W> {
}
}
- SerializedPageWriter::new(self.buf).write_metadata(&metadata)?;
let (_, on_close) = self.get_on_close();
on_close(close)
}
@@ -808,14 +804,6 @@ impl<'a, W: Write + Send> PageWriter for
SerializedPageWriter<'a, W> {
Ok(spec)
}
- fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()>
{
- let mut protocol = TCompactOutputProtocol::new(&mut self.sink);
- metadata
- .to_column_metadata_thrift()
- .write_to_out_protocol(&mut protocol)?;
- Ok(())
- }
-
fn close(&mut self) -> Result<()> {
self.sink.flush()?;
Ok(())