Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by OlgaN: http://wiki.apache.org/pig/ProposedProjects ------------------------------------------------------------------------------ || Optimization || In many cases data to be joined is already sorted and partitioned on the same key. Pig needs to be able to take advantage of this and do these joins in the map. The join could be done by sampling one input to determine the value of the join key at the beginning of every HDFS block. This would form an index. Then in a second MR job can be run with the other input. Based on the key seen in the second input, the appropriate blocks of the first input can also be loaded into the map and the join done. || || gates || || || Optimization || The combiner is not currently used if FILTER is in the FOREACH. In some cases it could still be used. || [https://issues.apache.org/jira/browse/PIG-479 479] || olgan || || || Optimization || Currently when types of data are declared Pig inserts a FOREACH immediately after the LOAD that does the conversions. These conversions should be delayed until the field is actually used. || [https://issues.apache.org/jira/browse/PIG-410 410] || olgan || gates || - || Optimization || When an order by is not the only operation in a pig script, it is done in two additional MR jobs. The first job samples using a sampling loader, the second does the sort. The sample is used to construct a partitioner that equally balances the data in the sort. The sampler needs to be changed to be a !EvalFunc instead of a loader. This way a split can be but in the proceeding MR job, with the main data being written out and the other part flowing to the sampler func, which can then write out the sample. The final MR job can then be the sort. || || gates || || + || Optimization || When an order by is not the only operation in a pig script, it is done in two additional MR jobs. The first job samples using a sampling loader, the second does the sort. The sample is used to construct a partitioner that equally balances the data in the sort. The sampler needs to be changed to be a !EvalFunc instead of a loader. This way a split can be but in the proceeding MR job, with the main data being written out and the other part flowing to the sampler func, which can then write out the sample. The final MR job can then be the sort. || [https://issues.apache.org/jira/browse/PIG-791 791] || gates || olgan || || Optimization || When an order by is the only operation in a pig script it is currently done in 3 MR jobs. The first converts it to BinStorage format (because the sample loader reads that format), the second samples, and the third sorts. Once the changes mentioned above to make the sampler an !EvalFunc are done it should be changed to be done in 2 MR jobs instead of 3. || [https://issues.apache.org/jira/browse/PIG-460 460] || gates || || || Optimization || The Pig optimizer should be used to determine when fields in a record are no longer needed and put in FOREACH statements to project out the unecessary data as early as possible. || [https://issues.apache.org/jira/browse/PIG-466 466] || olgan || || || Optimization || The Pig optimizers needs to call fieldsToRead so that Load functions that can do column skipping do it. || || gates || ||
