[ 
https://issues.apache.org/jira/browse/PIG-49?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558870#action_12558870
 ] 

Utkarsh Srivastava commented on PIG-49:
---------------------------------------

This issue requires a substantial redesign of the Pig execution engine.

We are currently investigating 2 different approaches to the problem of having 
to hold bags in memory.

1. Push interface for functions that take one bag as input
============================================

Common functions like SUM, COUNT, and similar functions that take one bag as 
input could implement an interface as follows:

group_begin()
group_next()
group_finish()

We could then push one tuple of the bag at a time to each function, thereby 
eliminating the need to hold the bag in memory.

Disadvantage: There is no intuitive push API for functions that take multiple 
bags as input. 

2. Pull interface for functions as today, but run different functions in 
different threads
================================================================

A pull interface is where you get handed the whole input and you can iterate 
over it. This is what we have today, and is the easiest for a UDF writer to 
work with. This is the only model that is intuitive when you have multiple bags 
as input. 

To prevent holding the bag in memory, while still allowing multiple functions 
to access it in a pull based manner, we run each function in a separate thread, 
and buffer only a small portion of each bag. If any thread gets too far ahead 
of others, we suspend it until others catch up.

Disadvantage: Context switching, synchronization overhead.

Currently, Ben is experimenting with model 2. But it seems to be at least twice 
as slow as model 1.

Feel free to comment on this issue, if you have opinions about this.



> optimize bag usage
> ------------------
>
>                 Key: PIG-49
>                 URL: https://issues.apache.org/jira/browse/PIG-49
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>
> (1) Currently, we always bring the entire bag into memory even though in most 
> cases we just need to stream through it. This is very inefficient in terms of 
> memory and CPU usage.
> (2) If we are doing multiple computations on the same group, we iterate over 
> the bag that represents the group several times. This is very inefficient 
> especially for spilled bags.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to