[
https://issues.apache.org/jira/browse/NIFI-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Stieglitz reassigned NIFI-14596:
---------------------------------------
Assignee: Daniel Stieglitz
> When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are
> duplicate column names data is skewed and lost
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: NIFI-14596
> URL: https://issues.apache.org/jira/browse/NIFI-14596
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Daniel Stieglitz
> Assignee: Daniel Stieglitz
> Priority: Major
>
> I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are
> duplicate column names as only the first column is picked up. For example I
> have an Excel sheet with the following column names and the first row of data
>
> {code:java}
> OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins |
> Frequency | Intervals | Item Type | Frequency | Rel. Freq | % Freq.
> 9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | |
> Pencil | 13 | 0.302325581395349 | 30.2%{code}
>
> The resulting schema is
> {code:java}
> {"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
>
> Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
> Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel.
> Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["%
> Freq."]}]}{code}
>
> Note how there is only one field name with "Frequency". This actually causes
> data to be dropped and skewed
> as seen below as the field Rel. Freq (which was renamed to Rel__Freq) has the
> original value which the second Frequency column had and not its value of
> 0.302325581395349.
> In addition, the last value of % Freq. is not present.
> {code:java}
> MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk,
> Units=2, Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6,
> Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0,
> __Freq_=0.3023255813953488}]{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)