Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=1&rev2=2 -------------------------------------------------- '''Estimated Development Effort:''' small + ==== Combiner Not Used with Limit or Filter ==== + Pig Scripts that have a foreach with a nested limit or filter do not use the combiner even when they could. Not all filters can use the combiner, but in some cases + they can. I think all limits could at least apply the limit in the combiner, though the UDF itself may only be executed in the reducer. + + '''Category:''' Performance + + '''Dependency:''' Map Reduce Optimizer + + '''References:''' [[https://issues.apache.org/jira/browse/PIG-479|PIG-479]] + + '''Estimated Development Effort:''' small + ==== Clean Up File Access Packages ==== Early on Pig sought to be completely Hadoop independent in its front end processing (parsing, logical plan, optimizer). To this end a number of abstractions were created for file access, which are located in the org.apache.pig.backend.datastorage package. Now that we have modified @@ -189, +201 @@ '''References:''' '''Estimated Development Effort:''' large + + ==== Order By for Small Data ==== + Currently Pig always samples the data for an order by and splits it across multiple machines. In cases where the data to be ordered is small enough to fit on a + single node, the sample stage should be eliminated and the sorting done by a identity mapper plus reduce job. + + '''Category:''' Performance + + '''Dependency:''' + + '''References:''' [[https://issues.apache.org/jira/browse/PIG-483|PIG-483]] + + '''Estimated Development Effort:''' small ==== Outer Join for Merge Join ==== Merge join is the last join type to not support outer join. Right outer join is doable in the current infrastructure. Left and full outer join will require an @@ -445, +469 @@ '''Estimated Development Effort:''' depends on what type of integration is chosen + ==== Physical Operators Take List of Tuples ==== + Currently tuples are passed one at a time between physical operators. Moving all the way through the pipeline for each tuple causes a lot of context switching. We + need to investigate batching tuples and passing a list between operators instead. In the map phase this would be likely to help, though we would want to + re-implement our map implementation to take control from Map Reduce so we get multiple records at a time. In reduce it is less clear, since tuples in reduce can + tend to be large (since they already contain the group) and thus batching them may cause memory problems. + + '''Category:''' Performance + + '''Dependency:''' + + '''References:''' [[https://issues.apache.org/jira/browse/PIG-688|PIG-688]] + + '''Estimated Development Effort:''' medium (involves rewrite of many physical operators) + + + ==== Run Map Reduce Jobs Directly From Pig ==== It would be very useful to be able to run arbitrary Map Reduce jobs from inside Pig. This would look something like:
