[
https://issues.apache.org/jira/browse/DRILL-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073321#comment-17073321
]
Paul Rogers commented on DRILL-7675:
------------------------------------
The next step is to force more parallelism as in the ticket description:
{code:sql}
ALTER SESSION SET `planner.slice_target` = 25
{code}
The meaning of this option is not especially clear: it is the number of
(estimated) rows per operator at which the planner will parallelize that
operator to the limit we set earlier. The default is 100K (which seems overly
large since Drill batches are on the order of several thousand nodes. Perhaps
another thing to fix.) Looking at the plan for this query, the planner has
guessed between 20K and 1M row, depending on operator.
The user reduced the number to 25, which pretty much forces even the smallest
operator to be parallelized.
Let's clear the max per node so we use the default, then change the slice
target to 1000. We now get 6 scan threads (on an 8-core CPU), run time of 3.2
seconds. So, with data on a local SSD, there is no benefit to additional
parallelism. (Perhaps S3 sees a different result.) However, memory use has
increased, 520 MB * 6 = 3 GB for the hash partition sender. (Remember, we only
have 2 MB of data.) So, this is clearly another problem, and suggests we'll
have problems as we increase parallelism.
Dropping the slice target to 25 produces the same result as a target of 1000
rows.
Increase max per node (parallelism) to 8. Run time increases to 4 seconds.
Memory use has increased to 550 MB per each of the 8 partition senders.
At a parallelism of 10 we see the expected out-of-memory error:
{noformat}
RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
{noformat}
The Web UI does not show the stack trace. When run in the debugger:
{noformat}
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One or
more nodes ran out of memory while executing the query.
null
Fragment: 3:1
...
Caused by: org.apache.drill.exec.exception.OutOfMemoryException:
at
org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
at
org.apache.drill.exec.test.generated.PartitionerGen64$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.java:380)
at
org.apache.drill.exec.test.generated.PartitionerGen64$OutgoingRecordBatch.initializeBatch(PartitionerTemplate.java:400)
at
org.apache.drill.exec.test.generated.PartitionerGen64.setup(PartitionerTemplate.java:126)
at
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createClassInstances(PartitionSenderRootExec.java:263)
at
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createPartitioner(PartitionSenderRootExec.java:218)
at
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:188)
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:93)
...
{noformat}
Which is indeed what we expected: the hash partition sender is using vastly too
much memory for the simple query.
Digging deeper, the partition sender allocates memory for:
* In each of 10 sender threads,
* One batch per receiver: so 10 receivers when parallelism is 10
* 1023 records per batch
* ~34 fields per record (some in arrays, so multiply by 5 at each level)
* 8 - 50 bytes per field
This gives approximately:
* ~1500 bytes/record
* ~1.5 MB/batch (1,476,608 actual value from debugger)
* 15 MB/sender
* 150 MB total
It is unclear, however, why each sender ends up allocating 677,561,600 bytes
(645MB from instrumented code).
Digging even deeper, the code completely ignores the requested batch size of
1023 records and instead allocates batches of 4096 records. 150MB (for 1K
records) * 4 = 600 MB (for 4K records). Fixing the above bug solves the problem.
> Very slow performance and Memory exhaustion while querying on very small
> dataset of parquet files
> -------------------------------------------------------------------------------------------------
>
> Key: DRILL-7675
> URL: https://issues.apache.org/jira/browse/DRILL-7675
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization, Storage - Parquet
> Affects Versions: 1.18.0
> Environment: [^sample-dataset.zip]
> Reporter: Idan Sheinberg
> Assignee: Paul Rogers
> Priority: Critical
> Attachments: sample-dataset.zip
>
>
> Per our discussion in Slack/Dev-list Here are all details and sample data-set
> to recreate problematic query behavior:
> * We are using Drill 1.18.0-SNAPSHOT built on March 6
> * We are joining on two small Parquet datasets residing on S3 using the
> following query:
> {code:java}
> SELECT
> CASE
> WHEN tbl1.`timestamp` IS NULL THEN tbl2.`timestamp`
> ELSE tbl1.`timestamp`
> END AS ts, *
> FROM `s3-store.state.`/164` AS tbl1
> FULL OUTER JOIN `s3-store.result`.`/164` AS tbl2
> ON tbl1.`timestamp`*10 = tbl2.`timestamp`
> ORDER BY ts ASC
> LIMIT 500 OFFSET 0 ROWS
> {code}
> * We are running drill in a single node setup on a 16 core, 64GB ram
> machine. Drill heap size is set to 16GB, while max direct memory is set to
> 32GB.
> * As the dataset consist of really small files, Drill has been tweaked to
> parallelize on small item count by tweaking the following variables:
> {code:java}
> planner.slice_target = 25
> planner.width.max_per_node = 16 (to match the core count){code}
> * Without the above parallelization, query speeds on parquet files are super
> slow (tens of seconds)
> * While queries do work, we are seeing non-proportional direct memory/heap
> utilization. (up 20GB of direct memory used, a min of 12GB heap required)
> * We're still encountering the occasional OOM of memory error (we're also
> seeing heap exhaustion, but I guess that's another indication to same
> problem. Reducing the node parallelization width to say, 8, reduces memory
> contention, though it still reaches 8 gb of direct memory
> {code:java}
> User Error Occurred: One or more nodes ran out of memory while executing the
> query. (null)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or
> more nodes ran out of memory while executing the query.null[Error Id:
> 67b61fc9-320f-47a1-8718-813843a10ecc ]
> at
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:657)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:338)
> at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
> at
> org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.java:380)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.initializeBatch(PartitionerTemplate.java:400)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5.setup(PartitionerTemplate.java:126)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createClassInstances(PartitionSenderRootExec.java:263)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createPartitioner(PartitionSenderRootExec.java:218)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:188)
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:93)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:323)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:310)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:310)
> ... 4 common frames omitted{code}
> I've attached a (real!) sample data-set to match the query above. That same
> dataset recreates the aforementioned memory behavior
> Help, please.
> Idan
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)