[GitHub] [nifi] markap14 commented on a change in pull request #4282: NIFI-7462: Update to allow FlowFile Table's schema to be more intelligent when using CHOICE types

GitBox Wed, 20 May 2020 09:16:23 -0700


markap14 commented on a change in pull request #4282:
URL: https://github.com/apache/nifi/pull/4282#discussion_r428138798




##########
File path: 
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/queryrecord/FlowFileTable.java
##########
@@ -223,12 +225,69 @@ private RelDataType getRelDataType(final DataType 
fieldType, final JavaTypeFacto
             case BIGINT:
                 return typeFactory.createJavaType(BigInteger.class);
             case CHOICE:
+                final ChoiceDataType choiceDataType = (ChoiceDataType) 
fieldType;
+                DataType widestDataType = 
choiceDataType.getPossibleSubTypes().get(0);
+                for (final DataType possibleType : 
choiceDataType.getPossibleSubTypes()) {
+                    if (possibleType == widestDataType) {
+                        continue;
+                    }
+                    if 
(possibleType.getFieldType().isWiderThan(widestDataType.getFieldType())) {
+                        widestDataType = possibleType;
+                        continue;
+                    }
+                    if 
(widestDataType.getFieldType().isWiderThan(possibleType.getFieldType())) {
+                        continue;
+                    }
+
+                    // Neither is wider than the other.
+                    widestDataType = null;
+                    break;
+                }
+
+                // If one of the CHOICE data types is the widest, use it.
+                if (widestDataType != null) {
+                    return getRelDataType(widestDataType, typeFactory);
+                }
+
+                // None of the data types is strictly the widest. Check if all 
data types are numeric.
+                // This would happen, for instance, if the data type is a 
choice between float and integer.
+                // If that is the case, we can use a String type for the table 
schema because all values will fit
+                // into a String. This will still allow for casting, etc. if 
the query requires it.
+                boolean allNumeric = true;
+                for (final DataType possibleType : 
choiceDataType.getPossibleSubTypes()) {
+                    if (!isNumeric(possibleType)) {
+                        allNumeric = false;
+                        break;
+                    }
+                }
+
+                if (allNumeric) {

Review comment:
       @pcgrenier I do see your argument. But I think I disagree. Let's say 
that we have the following CSV then:
   ```
   name, other
   markap14, 48
   pcgrenier, 19
   ```
   And the data's schema indicates that "other" is a CHOICE between STRING or 
INT. Fundamentally, it comes down to one question: in this case, should we 
allow the user to run a query like `SELECT SUM(other) as total FROM FLOWFILE`.
   
   Your argument is yes, because I happen to know that for this particular 
FlowFile it will succeed.
   
   My argument, though, is that we should not allow it. Yes, it would succeed 
for this FlowFile, but what will happen is that it will fail for other 
FlowFiles. Other FlowFiles that have the same schema and other FlowFiles for 
which the data is exactly correct according to the schema. This makes things 
more confusing because some FlowFiles succeed and others fail. All the while, 
the schema clearly indicates that the data may be a STRING, and you shouldn't 
really be able to sum together STRING data. If you do sort out all the number 
values, it makes sense to use a schema that reflects that. Otherwise, if we 
allow summing together a CHOICE[int, string] we're not really honoring the 
schema, in my point of view.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] markap14 commented on a change in pull request #4282: NIFI-7462: Update to allow FlowFile Table's schema to be more intelligent when using CHOICE types

Reply via email to