Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

via GitHub Sun, 26 Oct 2025 03:54:41 -0700


alamb commented on code in PR #8651:
URL: https://github.com/apache/arrow-rs/pull/8651#discussion_r2463729212



##########
parquet/src/bin/parquet-concat.rs:
##########
@@ -66,47 +66,55 @@ impl Args {
 
         let output = File::create(&self.output)?;
 
-        let inputs = self
-            .input
-            .iter()
-            .map(|x| {
-                let reader = File::open(x)?;
-                let metadata = 
ParquetMetaDataReader::new().parse_and_finish(&reader)?;
-                Ok((reader, metadata))
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let schema = {
+            let inputs = self
+                .input
+                .iter()
+                .map(|x| {
+                    let reader = File::open(x)?;
+                    let metadata = 
ParquetMetaDataReader::new().parse_and_finish(&reader)?;

Review Comment:
   This is still parsing the metadata from all the files into memory before 
checking the schema
   
   If you are looking to support the large file usecase better, it would 
require fewer resources (memory) to read the schema from the first file, and 
then verify the schema from the remaining files one at a time rather than 
reading the metadata for all files before validating



##########
parquet/src/bin/parquet-concat.rs:
##########
@@ -66,47 +66,55 @@ impl Args {
 
         let output = File::create(&self.output)?;
 
-        let inputs = self
-            .input
-            .iter()
-            .map(|x| {
-                let reader = File::open(x)?;
-                let metadata = 
ParquetMetaDataReader::new().parse_and_finish(&reader)?;
-                Ok((reader, metadata))
-            })
-            .collect::<Result<Vec<_>>>()?;
+        let schema = {

Review Comment:
   Since there are no tests, I think a comment here explaining the rationale 
for not keeping the files open is probably good
   
   ```suggestion
           // Check schemas in a first pass to make sure they all match
           // and then do the work in a second pass after the validatin
           let schema = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] parquet-concat: handle large number of files. [arrow-rs]

Reply via email to