[ 
https://issues.apache.org/jira/browse/PHOENIX-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334854#comment-15334854
 ] 

Lars Hofhansl commented on PHOENIX-3000:
----------------------------------------

It depends. :) Not copying it better if we (say) pipe a stream of Cell through 
and only capture the past N. Or if we simply keep some aggregates, or region 
start/end keys, etc.
If we potentially keep a reference to backing array of more or less random 
Cells we might end up holding on to a lot of heap space.

Why would memory management not be applicable here? Since we keep the key in 
the HashMap, there is no principal limit to how much heap this map might 
consume. I think we should keep track of the size and we're over some maximum, 
fail the query (that's better than causing issue with the region server - in 
which case the query would also fail :)

I'll look at InMemoryGroupByCache... Maybe that we can do in another jira.

I'll do some more safety tests with this one, and if all looks good commit it. 
Thanks for taking a look!


> Reduce memory consumption during DISTINCT aggregation
> -----------------------------------------------------
>
>                 Key: PHOENIX-3000
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3000
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>         Attachments: 3000.txt
>
>
> In {{DistinctValueWithCountServerAggregator.aggregate}} we hold on the ptr 
> handed to us from HBase.
> Note that this pointer points into an HFile Block, and hence we hold onto the 
> entire block for the duration of the aggregation.
> If the column has high cardinality we might attempt holding the entire table 
> in memory in the extreme case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to