[
https://issues.apache.org/jira/browse/DRILL-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Victoria Markman updated DRILL-2677:
------------------------------------
Fix Version/s: (was: 1.0.0)
1.1.0
> Query does not go beyond 4096 lines in small JSON files
> -------------------------------------------------------
>
> Key: DRILL-2677
> URL: https://issues.apache.org/jira/browse/DRILL-2677
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - JSON
> Environment: drill 0.8 official build
> Reporter: Alexander Reshetov
> Assignee: Jason Altekruse
> Fix For: 1.1.0
>
> Attachments: dataset_4095_and_1.json, dataset_4096_and_1.json,
> dataset_sample.json.gz.part-aa, dataset_sample.json.gz.part-ab,
> dataset_sample.json.gz.part-ac, dataset_sample.json.gz.part-ad,
> dataset_sample.json.gz.part-ae, dataset_sample.json.gz.part-af
>
>
> Hello,
> I'm trying to execute next query:
> {code}
> select * from (select source.pck, source.`timestamp`,
> flatten(source.HostUpdateTypeNW.Transfers) as entry from
> dfs.`/mnt/data/dataset_4095_and_1.json` as source) as parsed;
> {code}
> And it works as expected and I got result:
> {code}
> +------------+------------+------------+
> | pck | timestamp | entry |
> +------------+------------+------------+
> | 3547 | 1419807470286356 |
> {"TransferingPurpose":"8","TransferingImpact":"88","TransferingKind":"8","TransferingTime":"888888888","PackageOrigSenderID":"8","TransferingID":"88888","TransitCN":"888","PackageChkPnt":"8888","PackageFullSize":"8","TransferingSessionID":"8","SubpackagesCounter":"8"}
> |
> +------------+------------+------------+
> 1 row selected (0.188 seconds)
> {code}
> This file contains 4095 same lines of one JSON string + at the end another
> JOSN line (see attached file dataset_4095_and_1.json)
> The problem is when first string repeats more than 4095 times query got
> exception. Here is query for file with 4096 string of first type + 1 string
> of another (see attached file dataset_4096_and_1.json).
> {code}
> select * from (select source.pck, source.`timestamp`,
> flatten(source.HostUpdateTypeNW.Transfers) as entry from
> dfs.`/mnt/data/dataset_4096_and_1.json` as source) as parsed;
> Exception in thread "2ae108ff-b7ea-8f07-054e-84875815d856:frag:0:0"
> java.lang.RuntimeException: Error closing fragment context.
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:224)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:187)
> at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException:
> org.apache.drill.exec.vector.NullableIntVector cannot be cast to
> org.apache.drill.exec.vector.RepeatedVector
> at
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.getFlattenFieldTransferPair(FlattenRecordBatch.java:274)
> at
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.setupNewSchema(FlattenRecordBatch.java:296)
> at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:78)
> at
> org.apache.drill.exec.physical.impl.flatten.FlattenRecordBatch.innerNext(FlattenRecordBatch.java:122)
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
> at
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:99)
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:89)
> at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
> at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:142)
> at
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:118)
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:68)
> at
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:96)
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:58)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:163)
> ... 4 more
> Query failed: RemoteRpcException: Failure while running fragment.,
> org.apache.drill.exec.vector.NullableIntVector cannot be cast to
> org.apache.drill.exec.vector.RepeatedVector [
> cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ]
> [ cb6c7914-438f-440a-9c74-fe39130feca9 on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> {code}
> It means that Drill stops analyze schema exactly after 4096 lines and that's
> why my query is failing.
> And I assume that such behavior lead to another issue from which I
> investigated this one. It could be shown on large files, perhaps Drill
> somehow split file into smaller chunks and in one of them exists similar
> sequence of lines (4096 of the same type from Drill point of view and it
> stops query which lead to another exception). Large file attached as
> dataset_sample.json.gz
> Here is view (dataset_sample.view.drill) which I use for query of large file:
> {code}
> {
> "name" : "dataset_sample",
> "sql" : "SELECT `Message`.`timestamp`,
> `flatten`(`Message`.`HostUpdateTypeCR`['Transfers']) AS `entries`\nFROM
> `dfs`.`/mnt/data/dataset_sample.json.gz` AS `Message`",
> "fields" : [ {
> "name" : "timestamp",
> "type" : "ANY"
> }, {
> "name" : "transfers",
> "type" : "ANY"
> } ],
> "workspaceSchemaPath" : [ "dfs", "mnt" ]
> }
> {code}
> And here is query which I'm trying to execute:
> {code}
> 0: jdbc:drill:zk=local> create table dataset_tbl as
> . . . . . . . . . . . > select dataset_sample.transfers.TransferingID as id,
> dataset_sample.transfers.TransferingKind as type from dataset_sample;
> Query failed: Query stopped., index: 9502, length: 1 (expected: range(0,
> 1024)) [ c5eac3ee-0266-4645-b6b5-2a1b58df4821 on testlab-broker:31010 ]
> Error: exception while executing query: Failure while executing query.
> (state=,code=0)
> 0: jdbc:drill:zk=local> Exception in thread "WorkManager-19"
> java.lang.IllegalStateException
> at
> com.google.common.base.Preconditions.checkState(Preconditions.java:133)
> at
> org.apache.drill.common.DeferredException.addException(DeferredException.java:47)
> at
> org.apache.drill.common.DeferredException.addThrowable(DeferredException.java:61)
> at
> org.apache.drill.exec.ops.FragmentContext.fail(FragmentContext.java:133)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:181)
> at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Please let me know if I should split this issue to two separate issues or if
> you need any additional info.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)