Re: [PR] NIFI-14596 Added logic to the ExcelHeaderSchemaStrategy to rename duplicate column names thereby ensuring no data loss or skewing of data. [nifi]

via GitHub Fri, 06 Jun 2025 09:37:44 -0700


exceptionfactory commented on code in PR #9975:
URL: https://github.com/apache/nifi/pull/9975#discussion_r2132505128



##########
nifi-extension-bundles/nifi-poi-bundle/nifi-poi-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##########
@@ -126,8 +130,26 @@ private List<String> getFieldNames(int firstRowIndex, Row 
row) throws SchemaNotF
                 fieldNames.add(fieldName);
             }
         }
+        final List<String> renamedDuplicateFieldNames = 
renameDuplicateFieldNames(fieldNames);
 
-        return fieldNames;
+        return renamedDuplicateFieldNames;
+    }
+
+    private List<String> renameDuplicateFieldNames(List<String> fieldNames) {
+        final Map<String, Integer> fieldNameCounts = new HashMap<>();
+        final List<String> renamedDuplicateFieldNames = new ArrayList<>();
+
+        for (String fieldName : fieldNames) {
+            if (fieldNameCounts.containsKey(fieldName)) {
+                int count = fieldNameCounts.get(fieldName) + 1;
+                fieldNameCounts.put(fieldName, count);
+                renamedDuplicateFieldNames.add(fieldName + "_" + count);

Review Comment:
   ```suggestion
                   final int count = fieldNameCounts.get(fieldName);
                   renamedDuplicateFieldNames.add("%s_%d".formatted(fieldName, 
count));
                   fieldNameCounts.put(fieldName, count + 1);
   ```



##########
nifi-extension-bundles/nifi-poi-bundle/nifi-poi-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##########
@@ -126,8 +130,26 @@ private List<String> getFieldNames(int firstRowIndex, Row 
row) throws SchemaNotF
                 fieldNames.add(fieldName);
             }
         }
+        final List<String> renamedDuplicateFieldNames = 
renameDuplicateFieldNames(fieldNames);
 
-        return fieldNames;
+        return renamedDuplicateFieldNames;
+    }
+
+    private List<String> renameDuplicateFieldNames(List<String> fieldNames) {

Review Comment:
   ```suggestion
       private List<String> renameDuplicateFieldNames(final List<String> 
fieldNames) {
   ```



##########
nifi-extension-bundles/nifi-poi-bundle/nifi-poi-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java:
##########
@@ -47,8 +48,11 @@ public class ExcelHeaderSchemaStrategy implements 
SchemaAccessStrategy {
     static final int NUM_ROWS_TO_DETERMINE_TYPES = 10; // NOTE: This number is 
arbitrary.
     static final AllowableValue USE_STARTING_ROW = new AllowableValue("Use 
Starting Row", "Use Starting Row",
             "The configured first row of the Excel file is a header line that 
contains the names of the columns. The schema will be derived by using the "
-                    + "column names in the header of the first sheet and the 
following " + NUM_ROWS_TO_DETERMINE_TYPES + " rows to determine the type(s) of 
each column " +
-                      "while the configured header rows of subsequent sheets 
are skipped.");
+                    + "column names in the header of the first sheet and the 
following " + NUM_ROWS_TO_DETERMINE_TYPES + " rows to determine the type(s) of 
each column "
+                    + "while the configured header rows of subsequent sheets 
are skipped. "
+                    + "NOTE: If there are duplicate column names then each 
subsequent duplicate column name is given a one up number. "
+                    + "For example, column names \"Frequency\", \"Intervals\", 
\"Frequency\" \"Name\", \"Frequency\", \"Intervals\" will be "
+                    + "changed to \"Frequency\", \"Intervals\", 
\"Frequency_2\" \"Name\", \"Frequency_3\", \"Intervals_2\".");

Review Comment:
   Recommend simplifying the example to focus on a single column for clarity:
   ```suggestion
                       + "For example, column names \"Name\", \"Name\" will be "
                       + "changed to \"Name\", \"Name_1\"");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] NIFI-14596 Added logic to the ExcelHeaderSchemaStrategy to rename duplicate column names thereby ensuring no data loss or skewing of data. [nifi]

Reply via email to