[ 
https://issues.apache.org/jira/browse/DRILL-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939000#comment-14939000
 ] 

Chris Westin commented on DRILL-3874:
-------------------------------------

The problem with the flatten() operation is the following:
The current operator design doesn't provide a means to push down the set of 
output fields that are of interest. When records pass through an operator, the 
result includes all of the original fields, as well as all the newly added 
computed fields that operator produces. The current design requires using a 
subsequent project (see the query profile) to filter out the fields you don't 
want. As a result, when you flatten a 4MB record, the output record includes 
the original 4MB record, plus whatever you extracted. In the sample given, that 
means that internally, for each element in the array being flattened, you get 
the original 4MB record, along with one of the 20,000 array elements each time.
We have some internal checks that prevent us from making any single value 
vector too large; when that happens, we flush what we've processed so far. One 
of those checks is going off after about 2,000 of the flattened elements are 
processed. Note that 2,000 x 4MB is about 8GB – this is why the peak memory 
usage for that operator in the profile is 8GB.
The better, longer term fix requires some re-architecting of the operators so 
that we can push down the list of fields of interest so that we don't pass 
through the originals in this way. But that's a much bigger change that will 
affect a lot of the system.
In the short term, in order to avoid using up so much memory, we need to know 
how much memory has been consumed so far during the flattening process. 
Unfortunately, our value vector implementation does not have a way to ask for 
that until after you've finished populating the vector; but by then it's too 
late – you've already used up a huge amount of memory. I'm adding something 
that will let us ask for that information. However, that comes with a 
performance penalty. In order to mitigate that, the flattening process will 
also have an adaptive algorithm added so that we can try to estimate memory 
usage by only asking for this information periodically. The period will be 
determined based on large records seen in the past, and how much memory they 
took up. A query like this one will start out optimistically, but if we hit an 
initial memory limit, then we start sampling this information and monitoring it 
periodically in order to prevent it from getting too large.

> flattening large JSON objects consumes too much direct memory
> -------------------------------------------------------------
>
>                 Key: DRILL-3874
>                 URL: https://issues.apache.org/jira/browse/DRILL-3874
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.1.0
>            Reporter: Chris Westin
>            Assignee: Chris Westin
>
> A JSON record has a field whose value is an array with 20,000 elements; the 
> record's size is 4MB. A select is used to flatten this. The query profile 
> reports that the peak memory utilization was 8GB, most of it used by the 
> flatten. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to