[
https://issues.apache.org/jira/browse/DRILL-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939000#comment-14939000
]
Chris Westin commented on DRILL-3874:
-------------------------------------
The problem with the flatten() operation is the following:
The current operator design doesn't provide a means to push down the set of
output fields that are of interest. When records pass through an operator, the
result includes all of the original fields, as well as all the newly added
computed fields that operator produces. The current design requires using a
subsequent project (see the query profile) to filter out the fields you don't
want. As a result, when you flatten a 4MB record, the output record includes
the original 4MB record, plus whatever you extracted. In the sample given, that
means that internally, for each element in the array being flattened, you get
the original 4MB record, along with one of the 20,000 array elements each time.
We have some internal checks that prevent us from making any single value
vector too large; when that happens, we flush what we've processed so far. One
of those checks is going off after about 2,000 of the flattened elements are
processed. Note that 2,000 x 4MB is about 8GB – this is why the peak memory
usage for that operator in the profile is 8GB.
The better, longer term fix requires some re-architecting of the operators so
that we can push down the list of fields of interest so that we don't pass
through the originals in this way. But that's a much bigger change that will
affect a lot of the system.
In the short term, in order to avoid using up so much memory, we need to know
how much memory has been consumed so far during the flattening process.
Unfortunately, our value vector implementation does not have a way to ask for
that until after you've finished populating the vector; but by then it's too
late – you've already used up a huge amount of memory. I'm adding something
that will let us ask for that information. However, that comes with a
performance penalty. In order to mitigate that, the flattening process will
also have an adaptive algorithm added so that we can try to estimate memory
usage by only asking for this information periodically. The period will be
determined based on large records seen in the past, and how much memory they
took up. A query like this one will start out optimistically, but if we hit an
initial memory limit, then we start sampling this information and monitoring it
periodically in order to prevent it from getting too large.
> flattening large JSON objects consumes too much direct memory
> -------------------------------------------------------------
>
> Key: DRILL-3874
> URL: https://issues.apache.org/jira/browse/DRILL-3874
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Data Types
> Affects Versions: 1.1.0
> Reporter: Chris Westin
> Assignee: Chris Westin
>
> A JSON record has a field whose value is an array with 20,000 elements; the
> record's size is 4MB. A select is used to flatten this. The query profile
> reports that the peak memory utilization was 8GB, most of it used by the
> flatten.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)