pacman82 opened a new issue, #3017:
URL: https://github.com/apache/arrow-rs/issues/3017
**Describe the bug**
<!--
A clear and concise description of what the bug is.
-->
This regards the output written by the `parquet` crate. Declaring a column
to containt a timestamp of microseconds using a `LogicalType` causes the
written file to **not** have a converted type. At least according to
`parquet-tools`.
**To Reproduce**
<!--
Steps to reproduce the behavior:
-->
1. Write a file `tmp.par` with a single column with type Timestamp of
Microseconds, using a logical type.
```rust
use std::sync::Arc;
use parquet::{
basic::{LogicalType, Repetition, Type},
data_type::Int64Type,
file::{properties::WriterProperties, writer::SerializedFileWriter},
format::{MicroSeconds, TimeUnit},
schema::types,
};
fn main() {
let mut data = Vec::with_capacity(1024);
let logical_type = LogicalType::Timestamp {
is_adjusted_to_u_t_c: false,
unit: TimeUnit::MICROS(MicroSeconds {}),
};
let field = Arc::new(
types::Type::primitive_type_builder("col1", Type::INT64)
.with_logical_type(Some(logical_type))
.with_repetition(Repetition::REQUIRED)
.build()
.unwrap(),
);
let schema = Arc::new(
types::Type::group_type_builder("schema")
.with_fields(&mut vec![field])
.build()
.unwrap(),
);
// Write data
let props = Arc::new(WriterProperties::builder().build());
let mut writer = SerializedFileWriter::new(&mut data, schema,
props).unwrap();
let mut row_group_writer = writer.next_row_group().unwrap();
let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
column_writer
.typed::<Int64Type>()
.write_batch(&[1, 2, 3, 4], None, None)
.unwrap();
column_writer.close().unwrap();
row_group_writer.close().unwrap();
writer.close().unwrap();
// Write file for inspection with parqute tools
std::fs::write("tmp.par", data).unwrap();
}
```
2. Install `parquet-tools` in a virtual environment and inspect the file
```shell
pip install parquet-tools==0.2.11
parquet-tools inspect tmp.par
```
The resulting output indicates no Converted type
```
############ file meta data ############
created_by: parquet-rs version 26.0.0
num_columns: 1
num_rows: 4
num_row_groups: 1
format_version: 1.0
serialized_size: 143
############ Columns ############
col1
############ Column(col1) ############
name: col1
path: col1
max_definition_level: 0
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds,
is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: UNCOMPRESSED (space_saved: 0%)
```
**Expected behavior**
<!--
A clear and concise description of what you expected to happen.
-->
I would have expected the converted type to show up in the Metainformation
emitted by parquet-tools.
**Additional context**
<!--
Add any other context about the problem here.
-->
Triggered by upstream `odbc2parquet` issue
<https://github.com/pacman82/odbc2parquet/issues/284>. Azure can not seem to
handle the output since migration to `LogicalType`.
Previously misdiagnosed this to not set the converted type correctly in the
schema information, this however does happen. See: #2984.
Thanks any help is appreciated!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]