[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697669#comment-17697669 ]
ASF GitHub Bot commented on PARQUET-2228: ----------------------------------------- wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128833713 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List<Path> inputFiles, Configuration conf) { + Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + + for (Path inputFile : inputFiles) { + try { + TransParquetFileReader reader = new TransParquetFileReader( + HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); + MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); + if (this.schema == null) { + this.schema = inputFileSchema; + } else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { + LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", + this.schema, inputFileSchema, inputFile); + throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } + } + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); + this.inputFiles.add(reader); + } catch (IOException e) { + throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: That sounds reasonable. Would you like to create a PR with your proposed change? @vectorijk > ParquetRewriter supports more than one input file > ------------------------------------------------- > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr > Reporter: Gang Wu > Assignee: Gang Wu > Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)