Tamas Palfy created NIFI-6640:
---------------------------------

             Summary: UNION/CHOICE types not handled correctly
                 Key: NIFI-6640
                 URL: https://issues.apache.org/jira/browse/NIFI-6640
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
            Reporter: Tamas Palfy
            Assignee: Tamas Palfy


When reading the following CSV:
{code}
Id|Value
1|3
2|3.75
3|3.85
4|8
5|2.0
6|4.0
7|some_string
{code}
And try to channel through a {{ConvertRecord}} processor, the following 
exception is thrown:
{code}
2019-09-06 18:25:48,936 ERROR [Timer-Driven Process Thread-2] 
o.a.n.processors.standard.ConvertRecord 
ConvertRecord[id=07635c71-016d-1000-3847-ff916164b32a] Failed to process 
StandardFlowFileRecord[uuid=4b4ab01a-b349-4f83-9b25-6a58d0b29
7c1,claim=StandardContentClaim 
[resourceClaim=StandardResourceClaim[id=1567786888281-1, container=default, 
section=1], offset=326669, 
length=56],offset=0,name=4b4ab01a-b349-4f83-9b25-6a58d0b297c1,size=56]; will 
route to failure: org.apa
che.nifi.processor.exception.ProcessException: Could not parse incoming data
org.apache.nifi.processor.exception.ProcessException: Could not parse incoming 
data
        at 
org.apache.nifi.processors.standard.AbstractRecordProcessor$1.process(AbstractRecordProcessor.java:170)
        at 
org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2925)
        at 
org.apache.nifi.processors.standard.AbstractRecordProcessor.onTrigger(AbstractRecordProcessor.java:122)
        at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
        at 
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:205)
        at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.nifi.serialization.MalformedRecordException: Error while 
getting next record. Root cause: 
org.apache.nifi.serialization.record.util.IllegalTypeConversionException: 
Cannot convert value [some_string] of type class j
ava.lang.String for field Value to any of the following available Sub-Types for 
a Choice: [FLOAT, INT]
        at 
org.apache.nifi.csv.CSVRecordReader.nextRecord(CSVRecordReader.java:119)
        at 
org.apache.nifi.serialization.RecordReader.nextRecord(RecordReader.java:50)
        at 
org.apache.nifi.processors.standard.AbstractRecordProcessor$1.process(AbstractRecordProcessor.java:156)
        ... 13 common frames omitted
Caused by: 
org.apache.nifi.serialization.record.util.IllegalTypeConversionException: 
Cannot convert value [some_string] of type class java.lang.String for field 
Value to any of the following available Sub-Types for a Choice: [FLOAT, INT
]
        at 
org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:166)
        at 
org.apache.nifi.serialization.record.util.DataTypeUtils.convertType(DataTypeUtils.java:116)
        at 
org.apache.nifi.csv.AbstractCSVRecordReader.convert(AbstractCSVRecordReader.java:86)
        at 
org.apache.nifi.csv.CSVRecordReader.nextRecord(CSVRecordReader.java:105)
        ... 15 common frames omitted
{code}

The problem is that {{FieldTypeInference}} has both a list of 
{{possibleDataTypes}} and a {{singleDataType}} and as long as an added dataType 
is not in a "wider" relationship with the previous types it is added to the 
{{possibleDataTypes}}. But once a "wider" type is added, it actually gets set 
as the {{singleDataType}} and the {{possibleDataTypes}} remains intact.

However when we try to determine the actual dataType, if the 
{{possibleDataTypes}} is not null then it will be used and the 
{{singleDataType}} will be _ignored_.

So in our example a {{FieldTypeInference}} with (FLOAT, INT) as 
{{possibleDataTypes}} and STRING as {{singleDataType}} will be created, the 
FLOAT or INT will be chosen and "some_string" will be tried being written as a 
float or integer.
----
Also there is an issue with the handling of multiple datatypes when _writing_ 
data.
When multiple datatypes are possible, a so-called CHOICE datatype is assigned 
in the inferred schema. This contains the possible datatypes in a list.
However most (if not all) of the times when choose a concrete datatype for a 
given value when writing it (tested with JSON and Avro writers), the first 
matching type is selected from the list. And in the current implementation, all 
number types are matching for all numbers, so 3.75 may be written as an INT, 
resulting in data loss.

The problem is that the type list is not in any particular order _and_ the 
first matching type is chosen.





--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to