[
https://issues.apache.org/jira/browse/DRILL-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074024#comment-17074024
]
ASF GitHub Bot commented on DRILL-7675:
---------------------------------------
paul-rogers commented on pull request #2047: DRILL-7675: Work around for
partitions sender memory use
URL: https://github.com/apache/drill/pull/2047
# [DRILL-7675](https://issues.apache.org/jira/browse/DRILL-7675): Work
around for partitions sender memory use
## Description
DRILL-7675 describes a combination of factors which exposed a flaw in the
partition sender:
* The partition sender holds one buffer for each of the receivers, resulting
in n^2 buffers total in the system; all on a single machine for a one-node
Drill.
* Every buffer holds 1024 rows.
* The size of each row depends on the row shape. In DRILL-7675, one table
has 250+ columns, some nested within repeated maps. Since each needs a vector
of 1024 values (or 5 * 1024 or even 5 * 5 * 1024), the total memory size is
large.
The result is that Drill attempts to allocate many GB of buffers. But, the
actual data set is only 2 MB in size.
DRILL-7686 describes the needed longer-term redesign. This PR includes a
workaround: the ability to reduce the number of rows per send buffer as the
receiver count increases. See Documentation below.
By enabling the new option, the query will now run in the configuration that
the user describes in DRILL-7675. The cost, however, is slower performance,
which is exactly what the user was trying to prevent by enabling excessive
parallelism. The best workaround in this case (at least with local files) is to
go with default parallelism.
Also includes a number of cleanup and diagnostic fixes found during the
investigation.
## Documentation
Adds a new system/session option to allow the buffer size to shrink linearly
with the increase in slice count, over some limit:
`exec.partition.mem_throttle`:
* The default is 0, which leaves the current logic unchanged.
* If set to a positive value, then when the slice count exceeds that amount,
the buffer size per sender is reduced.
* The reduction factor is 1 / (slice count - threshold), with a minimum
batch size of 256 records.
So, if we set the threshold at 2, and run 10 slices, each slice will get
1024 / 8 = 256 records.
This option controls memory, but at obvious cost of increasing overhead. One
could argue that this is a good thing. As the number of senders increases, the
number of records going to each sender decreases, which increases the time that
batches must accumulate before they are sent.
If the option is enabled, and buffer size reduction kicks in, you'll find an
info-level log message which details the reduction:
```
exec.partition.mem_throttle is set to 2: 10 receivers, reduced send buffer
size from 1024 to 256 rows
```
## Testing
Created an ad-hoc test using the query from DRILL-7675. Ran this test with a
variety of options, including with the new option enabled and disabled. See
DRILL-7675 for a full description of the analysis.
Ran the query from DRILL-7675 in the Drill server using the Web console with
the new option on and off (along with other variations.) Verified that, with
the option off, performance is the same before and after the change. (3 sec on
my machine.) Verified that, with the option on, the query completes even with
excessive parallelism (though the query does run slower in that case.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Very slow performance and Memory exhaustion while querying on very small
> dataset of parquet files
> -------------------------------------------------------------------------------------------------
>
> Key: DRILL-7675
> URL: https://issues.apache.org/jira/browse/DRILL-7675
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization, Storage - Parquet
> Affects Versions: 1.18.0
> Environment: [^sample-dataset.zip]
> Reporter: Idan Sheinberg
> Assignee: Paul Rogers
> Priority: Critical
> Attachments: sample-dataset.zip
>
>
> Per our discussion in Slack/Dev-list Here are all details and sample data-set
> to recreate problematic query behavior:
> * We are using Drill 1.18.0-SNAPSHOT built on March 6
> * We are joining on two small Parquet datasets residing on S3 using the
> following query:
> {code:java}
> SELECT
> CASE
> WHEN tbl1.`timestamp` IS NULL THEN tbl2.`timestamp`
> ELSE tbl1.`timestamp`
> END AS ts, *
> FROM `s3-store.state.`/164` AS tbl1
> FULL OUTER JOIN `s3-store.result`.`/164` AS tbl2
> ON tbl1.`timestamp`*10 = tbl2.`timestamp`
> ORDER BY ts ASC
> LIMIT 500 OFFSET 0 ROWS
> {code}
> * We are running drill in a single node setup on a 16 core, 64GB ram
> machine. Drill heap size is set to 16GB, while max direct memory is set to
> 32GB.
> * As the dataset consist of really small files, Drill has been tweaked to
> parallelize on small item count by tweaking the following variables:
> {code:java}
> planner.slice_target = 25
> planner.width.max_per_node = 16 (to match the core count){code}
> * Without the above parallelization, query speeds on parquet files are super
> slow (tens of seconds)
> * While queries do work, we are seeing non-proportional direct memory/heap
> utilization. (up 20GB of direct memory used, a min of 12GB heap required)
> * We're still encountering the occasional OOM of memory error (we're also
> seeing heap exhaustion, but I guess that's another indication to same
> problem. Reducing the node parallelization width to say, 8, reduces memory
> contention, though it still reaches 8 gb of direct memory
> {code:java}
> User Error Occurred: One or more nodes ran out of memory while executing the
> query. (null)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or
> more nodes ran out of memory while executing the query.null[Error Id:
> 67b61fc9-320f-47a1-8718-813843a10ecc ]
> at
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:657)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:338)
> at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.drill.exec.exception.OutOfMemoryException: null
> at
> org.apache.drill.exec.vector.complex.AbstractContainerVector.allocateNew(AbstractContainerVector.java:59)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.allocateOutgoingRecordBatch(PartitionerTemplate.java:380)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5$OutgoingRecordBatch.initializeBatch(PartitionerTemplate.java:400)
> at
> org.apache.drill.exec.test.generated.PartitionerGen5.setup(PartitionerTemplate.java:126)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createClassInstances(PartitionSenderRootExec.java:263)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.createPartitioner(PartitionSenderRootExec.java:218)
> at
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:188)
> at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:93)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:323)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:310)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:310)
> ... 4 common frames omitted{code}
> I've attached a (real!) sample data-set to match the query above. That same
> dataset recreates the aforementioned memory behavior
> Help, please.
> Idan
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)