[ 
https://issues.apache.org/jira/browse/NIFI-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Stieglitz updated NIFI-14596:
------------------------------------
    Description: 
 I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are 
duplicate column names as only the first column is picked up. For example I 
have an Excel sheet with the following column names and the first row of data
 
{code:java}
OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | Frequency 
| Intervals | Item Type | Frequency | Rel. Freq | % Freq.
9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | | 
Pencil | 13 | 0.302325581395349 | 30.2%{code}
 
The resulting schema is 
{code:java}
{"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
 
Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
 Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel. 
Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["% 
Freq."]}]}{code}
 
Note how there is only one field name with "Frequency". This actually causes 
data to be dropped and skewed
as seen below after printing the first MapRecord out to the command line as the 
field Rel. Freq (which was renamed to Rel__Freq) has the original value which 
the second Frequency column had and not its value of 0.302325581395349.
In addition, the last value of % Freq. is not present.
{code:java}
MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, Units=2, 
Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6, 
Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0, 
__Freq_=0.3023255813953488}]{code}

  was:
 I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are 
duplicate column names as only the first column is picked up. For example I 
have an Excel sheet with the following column names and the first row of data
 
{code:java}
OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | Frequency 
| Intervals | Item Type | Frequency | Rel. Freq | % Freq.
9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | | 
Pencil | 13 | 0.302325581395349 | 30.2%{code}
 
The resulting schema is 
{code:java}
{"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
 
Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
 Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel. 
Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["% 
Freq."]}]}{code}
 
Note how there is only one field name with "Frequency". This actually causes 
data to be dropped and skewed
as seen below printing MapRecord out to the command line as the field Rel. Freq 
(which was renamed to Rel__Freq) has the original value which the second 
Frequency column had and not its value of 0.302325581395349.
In addition, the last value of % Freq. is not present.
{code:java}
MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, Units=2, 
Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6, 
Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0, 
__Freq_=0.3023255813953488}]{code}


> When Excel Reader is configured with ExcelHeaderSchemaStrategy and there are 
> duplicate column names, data is skewed and lost
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-14596
>                 URL: https://issues.apache.org/jira/browse/NIFI-14596
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Daniel Stieglitz
>            Assignee: Daniel Stieglitz
>            Priority: Major
>
>  I am seeing a bug with the ExcelHeaderSchemaStrategy class when there are 
> duplicate column names as only the first column is picked up. For example I 
> have an Excel sheet with the following column names and the first row of data
>  
> {code:java}
> OrderDate | Region | Rep | Item | Units | Unit Cost | Total | Bins | 
> Frequency | Intervals | Item Type | Frequency | Rel. Freq | % Freq.
> 9/1/14 | Central | Smith | Desk | 2 | 125.00 | 250.00 | | 9 | 6 | | 0-9 | | 
> Pencil | 13 | 0.302325581395349 | 30.2%{code}
>  
> The resulting schema is 
> {code:java}
> {"type":"record","name":"nifiRecord","namespace":"org.apache.nifi","fields":[{"name":"OrderDate","type":[{"type":"int","logicalType":"date"},"null"]},{"name":"Region","type":["string","null"]},{"name":"Rep","type":["string","null"]},{"name":"Item","type":["string","null"]},{"name":"Units","type":["long","null"]},{"name":"Unit_Cost","type":["double","null"],"aliases":["Unit
>  
> Cost"]},{"name":"Total","type":["double","null"]},{"name":"column_7","type":["string","null"]},{"name":"Bins","type":["long","null"]},{"name":"Frequency","type":["long","null"]},{"name":"Intervals","type":["string","null"]},{"name":"column_11","type":["string","null"]},{"name":"Item_Type","type":["string","null"],"aliases":["Item
>  Type"]},{"name":"Rel__Freq","type":["double","null"],"aliases":["Rel. 
> Freq"]},{"name":"__Freq_","type":["double","null"],"aliases":["% 
> Freq."]}]}{code}
>  
> Note how there is only one field name with "Frequency". This actually causes 
> data to be dropped and skewed
> as seen below after printing the first MapRecord out to the command line as 
> the field Rel. Freq (which was renamed to Rel__Freq) has the original value 
> which the second Frequency column had and not its value of 0.302325581395349.
> In addition, the last value of % Freq. is not present.
> {code:java}
> MapRecord[{OrderDate=2014-09-01, Region=Central, Rep=Smith, Item=Desk, 
> Units=2, Unit_Cost=125.0, Total=250.0, column_7=null, Bins=9, Frequency=6, 
> Intervals=0-9, column_11=null, Item_Type=Pencil, Rel__Freq=13.0, 
> __Freq_=0.3023255813953488}]{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to