[jira] [Commented] (DRILL-3562) Query fails when using flatten on JSON data where some documents have an empty array

ASF GitHub Bot (JIRA) Thu, 05 Jan 2017 09:42:16 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801992#comment-15801992
 ]


ASF GitHub Bot commented on DRILL-3562:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/713#discussion_r94814315
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
    @@ -305,12 +306,23 @@ protected boolean setupNewSchema() throws 
SchemaChangeException {
     
         final NamedExpression flattenExpr = new 
NamedExpression(popConfig.getColumn(), new 
FieldReference(popConfig.getColumn()));
         final ValueVectorReadExpression vectorRead = 
(ValueVectorReadExpression)ExpressionTreeMaterializer.materialize(flattenExpr.getExpr(),
 incoming, collector, context.getFunctionRegistry(), true);
    -    final TransferPair tp = 
getFlattenFieldTransferPair(flattenExpr.getRef());
    -
    -    if (tp != null) {
    -      transfers.add(tp);
    -      container.add(tp.getTo());
    -      transferFieldIds.add(vectorRead.getFieldId().getFieldIds()[0]);
    +    final FieldReference fieldReference = flattenExpr.getRef();
    +    final TransferPair transferPair = 
getFlattenFieldTransferPair(fieldReference);
    +
    +    if (transferPair != null) {
    +      final ValueVector flattenVector = transferPair.getTo();
    +
    +      // checks that list has only default ValueVector and replaces 
resulting ValueVector to INT typed ValueVector
    +      if (exprs.size() == 0 && 
flattenVector.getField().getType().equals(Types.LATE_BIND_TYPE)) {
    +        final MaterializedField outputField = 
MaterializedField.create(fieldReference.getAsNamePart().getName(), 
Types.OPTIONAL_INT);
    +        final ValueVector vector = TypeHelper.getNewVector(outputField, 
oContext.getAllocator());
    --- End diff --
    
    The fix appears to be to transform an empty list into an empty list of 
integers. That is, Drill does not have the concept of "empty list", only "empty 
list of type X" and we are guessing the type to be integer.
    
    We've had issues elsewhere in the product where such guesses turn out to be 
wrong. Perhaps the next row/batch has a non-empty list, but of strings. Or 
worse, of objects (maps.) Downstream operators cannot handle this.
    
    The result is that a query fails for no better reason than we caused it to 
fail by guessing the wrong type.
    
    Clearly, fixing the broader problem is beyond the scope of this fix. I am 
pointing out, however, that a consequence of the assumptirnmade here is that 
some queries, somewhere later, will fail due to an artificial schema change.
    
    The correct solution is to introduce an "Unknown" type and mark this a 
vector of type "Unknown". All we know is that it is a list; the member types 
are unknown. Then, in downstream operators, when we encounter a schema change, 
we know that an empty list of "Unknown" type is compatible with a list of any 
other type (say maps.)


> Query fails when using flatten on JSON data where some documents have an 
> empty array
> ------------------------------------------------------------------------------------
>
>                 Key: DRILL-3562
>                 URL: https://issues.apache.org/jira/browse/DRILL-3562
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>    Affects Versions: 1.1.0
>            Reporter: Philip Deegan
>            Assignee: Serhii Harnyk
>              Labels: ready-to-commit
>             Fix For: Future
>
>
> Drill query fails when using flatten when some records contain an empty array 
> {noformat}
> SELECT COUNT(*) FROM (SELECT FLATTEN(t.a.b.c) AS c FROM dfs.`flat.json` t) 
> flat WHERE flat.c.d.e = 'f' limit 1;
> {noformat}
> Succeeds on 
> { "a": { "b": { "c": [  { "d": {  "e": "f" } } ] } } }
> Fails on
> { "a": { "b": { "c": [] } } }
> Error
> {noformat}
> Error: SYSTEM ERROR: ClassCastException: Cannot cast 
> org.apache.drill.exec.vector.NullableIntVector to 
> org.apache.drill.exec.vector.complex.RepeatedValueVector
> {noformat}
> Is it possible to ignore the empty arrays, or do they need to be populated 
> with dummy data?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-3562) Query fails when using flatten on JSON data where some documents have an empty array

Reply via email to