[ 
https://issues.apache.org/jira/browse/DRILL-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victoria Markman closed DRILL-3265.
-----------------------------------

> Query with window function and sort below that spills to disk runs out of 
> memory
> --------------------------------------------------------------------------------
>
>                 Key: DRILL-3265
>                 URL: https://issues.apache.org/jira/browse/DRILL-3265
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.0.0
>            Reporter: Victoria Markman
>            Assignee: Deneche A. Hakim
>              Labels: window_function
>             Fix For: 1.1.0
>
>         Attachments: drill-3265.log
>
>
> Failure:
> {code}
> 0: jdbc:drill:schema=dfs> select
> . . . . . . . . . . . . >   sum(ss_quantity) over(partition by ss_store_sk 
> order by ss_sold_date_sk)
> . . . . . . . . . . . . > from store_sales;
> java.lang.RuntimeException: java.sql.SQLException: RESOURCE ERROR: One or 
> more nodes ran out of memory while executing the query.
> Fragment 1:13
> [Error Id: 72609220-e431-41fc-8505-8f9740c96153 on atsqa4-133.qa.lab:31010]
>         at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73)
>         at 
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:85)
>         at sqlline.TableOutputFormat.print(TableOutputFormat.java:116)
>         at sqlline.SqlLine.print(SqlLine.java:1583)
>         at sqlline.Commands.execute(Commands.java:852)
>         at sqlline.Commands.sql(Commands.java:751)
>         at sqlline.SqlLine.dispatch(SqlLine.java:738)
>         at sqlline.SqlLine.begin(SqlLine.java:612)
>         at sqlline.SqlLine.start(SqlLine.java:366)
>         at sqlline.SqlLine.main(SqlLine.java:259)
> {code}
> Failure looks legitimate from customer point of view (granted that we know 
> that we can run out of memory in some cases)
> However, there are couple of problems with this scenario:
> 1. It looks like we are running out of memory during disk based sort
> {code}
> 2015-06-09 16:55:45,693 [2a88e633-b714-49a1-c36a-509a9c817c77:frag:1:19] WARN 
>  o.a.d.e.p.i.xsort.ExternalSortBatch - Starting to merge. 3311 batch groups. 
> Current allocated memory: 17456644
> 2015-06-09 16:55:45,763 [2a88e633-b714-49a1-c36a-509a9c817c77:frag:1:13] WARN 
>  o.a.d.exec.memory.BufferAllocator - Unable to allocate buffer of size 20833 
> due to memory limit. Current allocation: 19989451
> java.lang.Exception: null
>         at 
> org.apache.drill.exec.memory.TopLevelAllocator$ChildAllocator.buffer(TopLevelAllocator.java:253)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.TopLevelAllocator$ChildAllocator.buffer(TopLevelAllocator.java:273)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.vector.UInt1Vector.allocateNew(UInt1Vector.java:171) 
> [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.vector.NullableIntVector.allocateNew(NullableIntVector.java:204)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.vector.AllocationHelper.allocateNew(AllocationHelper.java:56)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen139.allocateVectors(PriorityQueueCopierTemplate.java:123)
>  [na:na]
>         at 
> org.apache.drill.exec.test.generated.PriorityQueueCopierGen139.next(PriorityQueueCopierTemplate.java:66)
>  [na:na]
>         at 
> org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:224)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:146)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:105)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:95)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:92)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:146)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:105)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:95)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.window.WindowFrameRecordBatch.innerNext(WindowFrameRecordBatch.java:123)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:146)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:105)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:95)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:129)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:146)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:83) 
> [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:95)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:73) 
> [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:259)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:253)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at java.security.AccessController.doPrivileged(Native Method) 
> [na:1.7.0_71]
>         at javax.security.auth.Subject.doAs(Subject.java:415) [na:1.7.0_71]
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
>  [hadoop-common-2.5.1-mapr-1503.jar:na]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:253)
>  [drill-java-exec-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.1.0-SNAPSHOT-rebuffed.jar:1.1.0-SNAPSHOT]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_71]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_71]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> 2015-06-09 16:55:45,765 [2a88e633-b714-49a1-c36a-509a9c817c77:frag:1:13] INFO 
>  o.a.d.c.exceptions.UserException - User Error Occurred
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more 
> nodes ran out of memory while executing the query.
> {code}
> 2. The same sort running by itself and spilling to disk finishes successfully.
> 3. There is an indication in drillbit.log that memory leak has occurred.
> {code}
> 2015-06-09 16:55:46,053 [2a88e633-b714-49a1-c36a-509a9c817c77:frag:1:11] 
> ERROR o.a.d.c.exceptions.UserException - SYSTEM ERROR: 
> java.lang.IllegalStateException: Failure while closing accountor.  Expected 
> private and shared pools to be set to initial values.  However, one or more 
> were not.  Stats are
>         zone    init    allocated       delta
>         private 10000000        10636   9989364
>         shared  10000000        3114634 6885366.
> Fragment 1:11
> [Error Id: 607d4f33-dda8-4cd9-8489-e34e914737e9 on atsqa4-133.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> java.lang.IllegalStateException: Failure while closing accountor.  Expected 
> private and shared pools to be set to initial values.  However, one or more 
> were not.  Stats are
>         zone    init    allocated       delta
>         private 10000000        10636   9989364
>         shared  10000000        3114634 6885366.
> Fragment 1:11
> {code}   
> 3. Ocasionally, after executing the same query after it failed, I see an 
> error on failing to create a spill directory even though I have enough disk 
> space in my spill directory that is configured to be on a local file system. 
> (remember, sort by itself runs fine)
> {code}
> 0: jdbc:drill:schema=dfs> select 
> . . . . . . . . . . . . >   sum(ss_quantity) over(partition by ss_store_sk 
> order by ss_sold_date_sk) 
> . . . . . . . . . . . . > from store_sales;
> java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: 
> java.io.IOException: Mkdirs failed to create 
> /tmp/drill/spill/2a88e74d-df47-146d-0cf9-8773303a7f37/major_fragment_1/minor_fragment_0/operator_4
>  (exists=false, cwd=file:///opt/mapr/drill/drill-1.1.0/bin)
> {code}
> Additional information:
> Plan for the query that runs out of memory:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select
> . . . . . . . . . . . . >   sum(ss_quantity) over(partition by ss_store_sk 
> order by ss_sold_date_sk)
> . . . . . . . . . . . . > from store_sales;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      UnionExchange
> 01-01        Project(EXPR$0=[CASE(>($3, 0), CAST($4):ANY, null)])
> 01-02          Window(window#0=[window(partition {1} order by [2] range 
> between UNBOUNDED PRECEDING and CURRENT ROW aggs [COUNT($0), 
> $SUM0($0)])])01-03            SelectionVectorRemover
> 01-04              Sort(sort0=[$1], sort1=[$2], dir0=[ASC], dir1=[ASC])
> 01-05                Project(ss_quantity=[$0], ss_store_sk=[$1], 
> ss_sold_date_sk=[$2])
> 01-06                  HashToRandomExchange(dist0=[[$1]])
> 02-01                    UnorderedMuxExchange
> 03-01                      Project(ss_quantity=[$0], ss_store_sk=[$1], 
> ss_sold_date_sk=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1))])
> 03-02                        Project(ss_quantity=[$2], ss_store_sk=[$1], 
> ss_sold_date_sk=[$0])
> 03-03                          Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:///tpcds100/parquet/store_sales]], 
> selectionRoot=/tpcds100/parquet/store_sales, numFiles=1, columns=[`ss_qua
> ntity`, `ss_store_sk`, `ss_sold_date_sk`]]])
> {code}
> The same sort executes succesfully when window function is not present:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select ss_quantity, ss_store_sk, 
> ss_sold_date_sk from store_sales order by ss_store_sk, ss_sold_date_sk;
> +------+------+
> | text | json |
> +------+------+
> | 00-00    Screen
> 00-01      Project(ss_quantity=[$0], ss_store_sk=[$1], ss_sold_date_sk=[$2])
> 00-02        SingleMergeExchange(sort0=[1 ASC], sort1=[2 ASC])
> 01-01          SelectionVectorRemover
> 01-02            Sort(sort0=[$1], sort1=[$2], dir0=[ASC], dir1=[ASC])
> 01-03              Project(ss_quantity=[$0], ss_store_sk=[$1], 
> ss_sold_date_sk=[$2])
> 01-04                HashToRandomExchange(dist0=[[$1]], dist1=[[$2]])
> 02-01                  UnorderedMuxExchange
> 03-01                    Project(ss_quantity=[$0], ss_store_sk=[$1], 
> ss_sold_date_sk=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($2, 
> hash64AsDouble($1)))])
> 03-02                      Project(ss_quantity=[$2], ss_store_sk=[$1], 
> ss_sold_date_sk=[$0])
> 03-03                        Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:///tpcds100/parquet/store_sales]], 
> selectionRoot=/tpcds100/parquet/store_sales, numFiles=1, 
> columns=[`ss_quantity`, `ss_store_sk`, `ss_sold_date_sk`]]])
> {code}
> To reproduce:
> 1. Single drillbit
> 2. Configuration:
>         DRILL_MAX_DIRECT_MEMORY="8G"
>         DRILL_HEAP="4G" 
> 3. Dataset: tpcds100 parquet
> 4. Run query:
> {code}
> select 
>   sum(ss_quantity) over(partition by ss_store_sk order by ss_sold_date_sk) 
> from store_sales;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to