[jira] Issue Comment Edited: (PIG-167) Experiment : A proper bag memory manager.

Pi Song (JIRA) Tue, 25 Mar 2008 06:39:36 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581667#action_12581667
 ]


pi_song edited comment on PIG-167 at 3/25/08 6:37 AM:
------------------------------------------------------

h1. Ignore this comment. The purpose of design has been changed !!!

I have designed and implemented GenerationalSpillableMemoryManager (See diagram)

(1) Resources (bags) are registered through registerResource method. They are 
maintained internally as WeakReferences. HandleNotification does accepting 
memory pressure from MXBean.
(2) The resource references are divided into two generation: eden, and survivor 
(generational memory manager concept). When reclaiming memory, resources in 
eden will be reclaimed first. If that is not enough, resources in the older 
generation will be reclaimed. This is solely for optimization.
(3) In each generation, sub-generations are maintain as a linkedlist (blue 
nodes in the diagram). Each node in the list contains a fixed size array (save 
memory) of WeakReferences.
(4) When a new resource is registered, it will be stored in the tail of 
sub-generation list. When the tail node is full, a new tail will be added.
(5) Memory reclaim is done from the head of sub-generation list (older). 
Resources that survive a reclaim will be relocated to survivor generation (They 
have lived long enough). This helps avoid in use bags being spilled and read 
over and over.
(6) Due to (4) and (5) "Register" and "Reclaim" will not have much lock 
contention because most of the time, they operate on different data.

Reclaim activity can be activated in two ways:-
- Memory pressure from MXBean
- RegisterResource method is where all the resources come from so we maintain a 
counter here. When the counter exceeds a threshold, activate a reclaim. This 
way the number of null WeakReferences will be controlled. (Having a timer that 
wakes up periodically doesn't sound right to me. Why cleaning up when there is 
no movement? )

Assumptions:-
- MapReduce is a repetitive work so most resources should have statistically 
same lifespan ( = Alan's assumption). (Evidence = intermediate bags generated 
from map work)
- Ones that survive while others that were created the same time are no longer 
in use must be special and tend to live much longer. 

Benchmark (don't have enough time to do it properly so might not be accurate):-
- When memory is more than enough, 3-5% faster.
- When memory is scarce, I can see up to 15% faster (One of my tests that does 
GROUP ALL)
*I need a proper benchmark. Can anyone give me good test cases or try this out 
and record the result for me?*

This is a preview one. More micro-tuning + testing has to be done.

      was (Author: pi_song):
    I have designed and implemented GenerationalSpillableMemoryManager (See 
diagram)

(1) Resources (bags) are registered through registerResource method. They are 
maintained internally as WeakReferences. HandleNotification does accepting 
memory pressure from MXBean.
(2) The resource references are divided into two generation: eden, and survivor 
(generational memory manager concept). When reclaiming memory, resources in 
eden will be reclaimed first. If that is not enough, resources in the older 
generation will be reclaimed. This is solely for optimization.
(3) In each generation, sub-generations are maintain as a linkedlist (blue 
nodes in the diagram). Each node in the list contains a fixed size array (save 
memory) of WeakReferences.
(4) When a new resource is registered, it will be stored in the tail of 
sub-generation list. When the tail node is full, a new tail will be added.
(5) Memory reclaim is done from the head of sub-generation list (older). 
Resources that survive a reclaim will be relocated to survivor generation (They 
have lived long enough). This helps avoid in use bags being spilled and read 
over and over.
(6) Due to (4) and (5) "Register" and "Reclaim" will not have much lock 
contention because most of the time, they operate on different data.

Reclaim activity can be activated in two ways:-
- Memory pressure from MXBean
- RegisterResource method is where all the resources come from so we maintain a 
counter here. When the counter exceeds a threshold, activate a reclaim. This 
way the number of null WeakReferences will be controlled. (Having a timer that 
wakes up periodically doesn't sound right to me. Why cleaning up when there is 
no movement? )

Assumptions:-
- MapReduce is a repetitive work so most resources should have statistically 
same lifespan ( = Alan's assumption). (Evidence = intermediate bags generated 
from map work)
- Ones that survive while others that were created the same time are no longer 
in use must be special and tend to live much longer. 

Benchmark (don't have enough time to do it properly so might not be accurate):-
- When memory is more than enough, 3-5% faster.
- When memory is scarce, I can see up to 15% faster (One of my tests that does 
GROUP ALL)
*I need a proper benchmark. Can anyone give me good test cases or try this out 
and record the result for me?*

This is a preview one. More micro-tuning + testing has to be done.
  
> Experiment : A proper bag memory manager.
> -----------------------------------------
>
>                 Key: PIG-167
>                 URL: https://issues.apache.org/jira/browse/PIG-167
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Pi Song
>         Attachments: diagram.gif, MemManager0.patch
>
>
> According to PIG-164, I think we still have room for improvement:-
> 1) Alan said
> {quote}
> "It rests on the assumption that data bags generally live about the same 
> amount of time, thus there won't be a long lived databag at the head of the 
> list blocking the cleaning of many stale references later in the list."
> {quote}
> By looking at a line of code in SpillableMemoryManager
> {noformat}
> Collections.sort(spillables, new Comparator<WeakReference<Spillable>>() {
> {noformat}
> - Alan's assumption might be wrong after the memory manager tries to spill 
> the list.
> - I don't understand why this has to be sorted and start spilling from the 
> smallest bags first. Most file systems are not good at handling small files 
> (specially ext2/ext3).
> 2) We use a linkedlist to maintain WeakReference. Normally a linkedlist 
> consumes double as much memory that an array would consume(for pointers). 
> Should it be better to change LinkedList to Array or ArrayList?
> 3) In SpillableMemoryManager, handleNotification which does a kind of I/O 
> intensive job shares the same lock with registerSpillable. This doesn't seem 
> to be efficient.
> 4) Sometimes I recognized that the bag currently in use got spilled and read 
> back over and over again. Essentially, the memory manager should consider 
> spilling bags currently not in use first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-167) Experiment : A proper bag memory manager.

Reply via email to