alamb commented on code in PR #8651:
URL: https://github.com/apache/arrow-rs/pull/8651#discussion_r2463729212
##########
parquet/src/bin/parquet-concat.rs:
##########
@@ -66,47 +66,55 @@ impl Args {
let output = File::create(&self.output)?;
- let inputs = self
- .input
- .iter()
- .map(|x| {
- let reader = File::open(x)?;
- let metadata =
ParquetMetaDataReader::new().parse_and_finish(&reader)?;
- Ok((reader, metadata))
- })
- .collect::<Result<Vec<_>>>()?;
+ let schema = {
+ let inputs = self
+ .input
+ .iter()
+ .map(|x| {
+ let reader = File::open(x)?;
+ let metadata =
ParquetMetaDataReader::new().parse_and_finish(&reader)?;
Review Comment:
This is still parsing the metadata from all the files into memory before
checking the schema
If you are looking to support the large file usecase better, it would
require fewer resources (memory) to read the schema from the first file, and
then verify the schema from the remaining files one at a time rather than
reading the metadata for all files before validating
##########
parquet/src/bin/parquet-concat.rs:
##########
@@ -66,47 +66,55 @@ impl Args {
let output = File::create(&self.output)?;
- let inputs = self
- .input
- .iter()
- .map(|x| {
- let reader = File::open(x)?;
- let metadata =
ParquetMetaDataReader::new().parse_and_finish(&reader)?;
- Ok((reader, metadata))
- })
- .collect::<Result<Vec<_>>>()?;
+ let schema = {
Review Comment:
Since there are no tests, I think a comment here explaining the rationale
for not keeping the files open is probably good
```suggestion
// Check schemas in a first pass to make sure they all match
// and then do the work in a second pass after the validatin
let schema = {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]