Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "LoadStoreRedesignProposal" page has been changed by PradeepKamath. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=16&rev2=17 -------------------------------------------------- 1. How will we worked with compressed files? !FileInputFormat already works with bzip and gzip compressed files, producing reasonable splits. !PigStorage will be reworked to depend on !FileInputFormat (or a descendant thereof, see next item) and should therefore be able to use this functionality. Currently Pig supports gz/bzip for arbitrary loadfunc/storefunc combinations. With this proposal, gz/bzip format will only be supported for load/store using PigStorage. - === Implementation details and status === + == Implementation details and status == - ==== Current status ==== + === Current status === A branch -'load-store-redesign' (http://svn.apache.org/repos/asf/hadoop/pig/branches/load-store-redesign) has been created to undertake work on this proposal. As of today (Nov 2. 2009) this branch has simple load-store working for PigStorage and BinStorage. Joins on multiple inputs and multi store queries with multi query optimization also work. Some of the recent changes in the proposal above (the changes noted under Nov 2. 2009 in the Changes below) have not been incorporated. A list (may not be comprehensive) of remaining tasks is listed in a subsection below. - ==== Notes on implementation details ==== + === Notes on implementation details === + This section is to document changes made at a high level to give an overall connected picture which code comments may not provide. + ==== Changes to work with Hadoop !InputFormat model ==== + + ==== Changes to work with Hadoop !OutputFormat model ==== + - ==== Remaining Tasks ==== + === Remaining Tasks === - * BinStorage needs to implement LoadMetadata's getSchema() to replace current determineSchema() + * !BinStorage needs to implement !LoadMetadata's getSchema() to replace current determineSchema() * piggybank loaders/storers need to be ported - * fix lineage code to use LoadCaster instead of LoadFunc + * fix lineage code to use !LoadCaster instead of !LoadFunc * local mode needs to be ported - * PigDump needs to be ported + * !PigDump needs to be ported - * poload needs to be ported + * !POLoad needs to be ported * Need to handle passing loadfunc specific info between different instances of loadfunc (Different instances in front end and between front end and back end - we need what is required in PIG-602) (setPartitionFilter() and pushOperators()for example needs this - these methods are called in the front end but the information passed is needed in the backend) - * For ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with + * For !ResourceSchema to be effectively used for communicating schema, we must fix the two level access issues with schema of bags in current schema before we make these changes, otherwise that same contagion will afflict us here. * Input/Output handler code in streaming needs to be ported * split by file will have to removed from language * fix code with FIXME in comment relating to load-store redesign - * Decide on what we should do with ReversibleLoadFunc and multiquery optimization + * Decide on what we should do with !ReversibleLoadFunc and multiquery optimization