Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change
The following page has been changed by ChrisOlston:
---+ Pig Execution Model
The goal is to decide how to structure operator evaluation pipelines in Pig.
The major issues include whether data flow follows a push model or a pull
model, whether an operator evaluation tree is multi-threaded or
single-threaded, and what the API for user-defined functions (UDFs) looks like.
---++ Alternative Execution Models
Some possible execution models are:
---+++ Model 1
* One thread.
* Top-level scheduler round-robins through "leaf" operators.
* Each time an operator is invoked, it gets handed exactly one tuple.
* Special "split" operator buffers data that gets fed to multiple operators;
at most one tuple gets buffered at each split point.
* UDF API: declare zero or one input bag as "streaming"; init() hands it all
data except the streaming bag; next() hands one tuple from streaming bag
---+++ Model 2
* One thread per "leaf" operator.
* Scheduling done by OS.
* Operator gets to read as many tuples as it wants; if it reads from
multiple inputs, can interleave "next()" calls on the inputs in arbitrary
* Split operator may buffer up to K tuples (or B bytes); if an operator
tries to read too far ahead it gets blocked until other operators reading from
the same buffer catch up.
* Deadlock can arise; need to detect it and release it by relaxing the K/B
constraint on one or more of the split buffers.
---++++ Model 1 Drawbacks
* underutilize multi-core systems? (depends whether the policy is to assign
several map/reduce tasks to a machine)
* difficult (or impossible?) to support operations that require streaming
access to multiple inputs (e.g., merge join, merge-based set difference, etc.
which operate over pre-sorted input streams)
* UDF APIs more complex?
---++++ Model 2 Drawbacks
* thread synchronization overhead
* complexity of multi-threaded implementation
---++++ Related work
* Fjords paper
* Stream Programming Model / MIT StreamIt [waiting for pointer]