markap14 commented on a change in pull request #4282:
URL: https://github.com/apache/nifi/pull/4282#discussion_r428138798
##########
File path:
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/queryrecord/FlowFileTable.java
##########
@@ -223,12 +225,69 @@ private RelDataType getRelDataType(final DataType
fieldType, final JavaTypeFacto
case BIGINT:
return typeFactory.createJavaType(BigInteger.class);
case CHOICE:
+ final ChoiceDataType choiceDataType = (ChoiceDataType)
fieldType;
+ DataType widestDataType =
choiceDataType.getPossibleSubTypes().get(0);
+ for (final DataType possibleType :
choiceDataType.getPossibleSubTypes()) {
+ if (possibleType == widestDataType) {
+ continue;
+ }
+ if
(possibleType.getFieldType().isWiderThan(widestDataType.getFieldType())) {
+ widestDataType = possibleType;
+ continue;
+ }
+ if
(widestDataType.getFieldType().isWiderThan(possibleType.getFieldType())) {
+ continue;
+ }
+
+ // Neither is wider than the other.
+ widestDataType = null;
+ break;
+ }
+
+ // If one of the CHOICE data types is the widest, use it.
+ if (widestDataType != null) {
+ return getRelDataType(widestDataType, typeFactory);
+ }
+
+ // None of the data types is strictly the widest. Check if all
data types are numeric.
+ // This would happen, for instance, if the data type is a
choice between float and integer.
+ // If that is the case, we can use a String type for the table
schema because all values will fit
+ // into a String. This will still allow for casting, etc. if
the query requires it.
+ boolean allNumeric = true;
+ for (final DataType possibleType :
choiceDataType.getPossibleSubTypes()) {
+ if (!isNumeric(possibleType)) {
+ allNumeric = false;
+ break;
+ }
+ }
+
+ if (allNumeric) {
Review comment:
@pcgrenier I do see your argument. But I think I disagree. Let's say
that we have the following CSV then:
```
name, other
markap14, 48
pcgrenier, 19
```
And the data's schema indicates that "other" is a CHOICE between STRING or
INT. Fundamentally, it comes down to one question: in this case, should we
allow the user to run a query like `SELECT SUM(other) as total FROM FLOWFILE`.
Your argument is yes, because I happen to know that for this particular
FlowFile it will succeed.
My argument, though, is that we should not allow it. Yes, it would succeed
for this FlowFile, but what will happen is that it will fail for other
FlowFiles. Other FlowFiles that have the same schema and other FlowFiles for
which the data is exactly correct according to the schema. This makes things
more confusing because some FlowFiles succeed and others fail. All the while,
the schema clearly indicates that the data may be a STRING, and you shouldn't
really be able to sum together STRING data. If you do sort out all the number
values, it makes sense to use a schema that reflects that. Otherwise, if we
allow summing together a CHOICE[int, string] we're not really honoring the
schema, in my point of view.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]