[ 
https://issues.apache.org/jira/browse/DRILL-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052565#comment-16052565
 ] 

Paul Rogers commented on DRILL-5211:
------------------------------------

Thanks for the link to DRILL-1960.

Yes, [~sphillips], if each operator had to do the replay, then the code in each 
operator becomes highly complex. The readers are already complex, so adding the 
necessary logic to replay a row adds even greater complexity. All this is 
clearly explained in DRILL-1960.

The change proposed here is a bit different, however. The new version of set 
safe has a limit, as was described in DRILL-1960. But, the way we handle the 
overflow row is different. Here, we propose to handle the overflow row 
"automagically" as part of a common batch mutator mechanism. From the point of 
view of the reader, the code writes entire rows and asks if it can write 
another. The common "mutator" handles the fiddly bits with the overflow row, 
moving it from the "full" batch to a new, look-ahead, empty batch. Because the 
mechanism is common, the result is simpler reader code. (And, yes, this means 
changing the existing readers that evolved ad-hoc ways of writing to vectors...)

[~jni] is also correct that the focus here is on the scan operator as we've 
seen that it is the worst offender in creating large vectors. Was just looking 
at a case in which the JSON reader tried to create a single vector of size 8 
TB. (4096 rows in which each row had an array of 1.8 GB in size. Clearly a 
pathological case...) Since we have to start somewhere, the readers are as good 
a place as any.

Indeed, we must handle the other operators. That is an open issue, to be 
tackled as a follow on. For other operators, we need a slightly different 
design: one that works with the framework that code gen uses. The case you 
mention is a classic example: we have a VARCHAR column that filled its vector 
to 16 GB. Project concatenates the column with itself. As of today, this 
produces a 32 MB vector. So, we need to modify project from a 1-in-1-out model 
to a possible 1-in-n-out model. We probably won't use the same "mutator" as for 
readers (though, maybe we can; I've not looked into the issue enough yet.)

Anyone know of existing mechanisms we could leverage and/or extend to help with 
the non-scan operators?

> Queries fail due to direct memory fragmentation
> -----------------------------------------------
>
>                 Key: DRILL-5211
>                 URL: https://issues.apache.org/jira/browse/DRILL-5211
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.9.0
>
>         Attachments: ApacheDrillMemoryFragmentationBackground.pdf, 
> ApacheDrillVectorSizeLimits.pdf, EnhancedScanOperator.pdf, 
> ScanSchemaManagement.pdf
>
>
> Consider a test of the external sort as follows:
> * Direct memory: 3GB
> * Input file: 18 GB, with one Varchar column of 8K width
> The sort runs, spilling to disk. Once all data arrives, the sort beings to 
> merge the results. But, to do that, it must first do an intermediate merge. 
> For example, in this sort, there are 190 spill files, but only 19 can be 
> merged at a time. (Each merge file contains 128 MB batches, and only 19 can 
> fit in memory, giving a total footprint of 2.5 GB, well below the 3 GB limit.
> Yet, when loading batch xx, Drill fails with an OOM error. At that point, 
> total available direct memory is 3,817,865,216. (Obtained from {{maxMemory}} 
> in the {{Bits}} class in the JDK.)
> It appears that Drill wants to allocate 58,257,868 bytes, but the 
> {{totalCapacity}} (again in {{Bits}}) is already 3,800,769,206, causing an 
> OOM.
> The problem is that, at this point, the external sort should not ask the 
> system for more memory. The allocator for the external sort is at just 
> 1,192,350,366 before the allocation request. Plenty of spare memory should be 
> available, released when the in-memory batches were spilled to disk prior to 
> merging. Indeed, earlier in the run, the sort had reached a peak memory usage 
> of 2,710,716,416 bytes. This memory should be available for reuse during 
> merging, and is plenty sufficient to fill the particular request in question.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to