[jira] [Created] (NIFI-14596) When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed and lost

Daniel Stieglitz (Jira) Fri, 23 May 2025 09:33:10 -0700

Daniel Stieglitz created NIFI-14596:
---------------------------------------


             Summary: When Excel Reader is configured with 
ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed 
and lost
                 Key: NIFI-14596
                 URL: https://issues.apache.org/jira/browse/NIFI-14596
             Project: Apache NiFi
          Issue Type: Bug
            Reporter: Daniel Stieglitz


 I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are 
duplicate column names as only the first column is picked up. For example I 
have an Excel sheet with the following column names and the first row of data
 
{code:java}
OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | Frequency 
| Intervals | Item Type | Frequency | Rel. Freq | % Freq.
9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | | 
Pencil | 13 | 0.302325581395349 | 30.2%{code}
 
The resulting schema is 
{code:java}
{"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
 
Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
 Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel. 
Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["% 
Freq."]}]}{code}
 
Note how there is only one field name with "Frequency". This actually causes 
data to be dropped and skewed
as seen below as the field Rel. Freq (which was renamed to Rel__Freq) has the 
original value which the second Frequency column had and not its value of 
0.302325581395349.
In addition, the last value of % Freq. is not present.
{code:java}
MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, Units=2, 
Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6, 
Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0, 
__Freq_=0.3023255813953488}]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (NIFI-14596) When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed and lost

Reply via email to