[jira] [Assigned] (NIFI-14596) When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed and lost

Daniel Stieglitz (Jira) Fri, 23 May 2025 09:35:26 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Stieglitz reassigned NIFI-14596:
---------------------------------------

    Assignee: Daniel Stieglitz

> When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are 
> duplicate column names data is skewed and lost
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14596
>                 URL: https://issues.apache.org/jira/browse/NIFI-14596
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Daniel Stieglitz
>            Assignee: Daniel Stieglitz
>            Priority: Major
>
>  I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are 
> duplicate column names as only the first column is picked up. For example I 
> have an Excel sheet with the following column names and the first row of data
>  
> {code:java}
> OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | 
> Frequency | Intervals | Item Type | Frequency | Rel. Freq | % Freq.
> 9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | | 
> Pencil | 13 | 0.302325581395349 | 30.2%{code}
>  
> The resulting schema is 
> {code:java}
> {"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
>  
> Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
>  Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel. 
> Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["% 
> Freq."]}]}{code}
>  
> Note how there is only one field name with "Frequency". This actually causes 
> data to be dropped and skewed
> as seen below as the field Rel. Freq (which was renamed to Rel__Freq) has the 
> original value which the second Frequency column had and not its value of 
> 0.302325581395349.
> In addition, the last value of % Freq. is not present.
> {code:java}
> MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, 
> Units=2, Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6, 
> Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0, 
> __Freq_=0.3023255813953488}]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NIFI-14596) When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed and lost

Reply via email to