[GitHub] carbondata pull request #2267: [CARBONDATA-2433] [Lucene GC Issue] Executor ...

manishgupta88 Thu, 03 May 2018 08:38:48 -0700

GitHub user manishgupta88 opened a pull request:

    https://github.com/apache/carbondata/pull/2267


    [CARBONDATA-2433] [Lucene GC Issue] Executor OOM because of GC when 
blocklet pruning is done using Lucene datamap

    **Problem**
    Executor OOM because of GC when blocklet pruning is done using Lucene 
datamap
    
    **Analysis**
    While seraching using lucene it creates a PriorityQueue to hold the 
documents. As size is not specified by default the PriorityQueue size is
    equal to the number of lucene documents. As the docuemnts start getting 
added to the heap the GC time increases and after some time task fails due
    to excessive GC and executor OOM occurs.
    Reference blog: 
http://lucene.472066.n3.nabble.com/Optimization-of-memory-usage-in-PriorityQueue-td590355.html
    
    **Fix**
    Specify the limit for first search and after that use the searchAfter API 
to search in incremental order with gieven PriorityQueue size.
    
     - [ ] Any interfaces changed?
     No
     - [ ] Any backward compatibility impacted?
     No
     - [ ] Document update required?
    No
     - [ ] Testing done
    Manually verified with 3.7 billion data. For a query, GC time came down to 
5 sec from 40 min.
      
     - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA. 
    NA


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/manishgupta88/carbondata lucene_gc_issue

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2267.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2267
    
----
commit ecea6009c55326817826bc4de8b14fad52b6db35
Author: manishgupta88 <tomanishgupta18@...>
Date:   2018-05-03T15:10:41Z

    Problem
    Executor OOM because of GC when blocklet pruning is done using Lucene 
datamap
    
    Analysis
    While seraching using lucene it creates a PriorityQueue to hold the 
documents. As size is not specified by default the PriorityQueue size is
    equal to the number of lucene documents. As the docuemnts start getting 
added to the heap the GC time increases and after some time task fails due
    to excessive GC and executor OOM occurs.
    Reference blog: 
http://lucene.472066.n3.nabble.com/Optimization-of-memory-usage-in-PriorityQueue-td590355.html
    
    Fix
    Specify the limit for first search and after that use the searchAfter API 
to search in incremental order with gieven PriorityQueue size.

----


---

[GitHub] carbondata pull request #2267: [CARBONDATA-2433] [Lucene GC Issue] Executor ...

Reply via email to