Daniel Stieglitz created NIFI-14596:
---------------------------------------
Summary: When Excel Reader is configured with
ExcelHeaderSchemaStrategy and there are duplicate column names data is skewed
and lost
Key: NIFI-14596
URL: https://issues.apache.org/jira/browse/NIFI-14596
Project: Apache NiFi
Issue Type: Bug
Reporter: Daniel Stieglitz
I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are
duplicate column names as only the first column is picked up. For example I
have an Excel sheet with the following column names and the first row of data
{code:java}
OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | Frequency
| Intervals | Item Type | Frequency | Rel. Freq | % Freq.
9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | |
Pencil | 13 | 0.302325581395349 | 30.2%{code}
The resulting schema is
{code:java}
{"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel.
Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["%
Freq."]}]}{code}
Note how there is only one field name with "Frequency". This actually causes
data to be dropped and skewed
as seen below as the field Rel. Freq (which was renamed to Rel__Freq) has the
original value which the second Frequency column had and not its value of
0.302325581395349.
In addition, the last value of % Freq. is not present.
{code:java}
MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, Units=2,
Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6,
Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0,
__Freq_=0.3023255813953488}]{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)