Apache Wiki
Tue, 05 Feb 2008 09:49:13 -0800
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/PigStreamingFunctionalSpec ------------------------------------------------------------------------------ {{{ <define command> ::= define <alias> <computation spec> <alias> ::= pig identifier - <comparison spec> ::= <UDF spec> | <command spec> + <computation spec> ::= <UDF spec> | <command spec> <UDF spec> ::= pig standard function spec <command spec> ::= `<command>` [<input spec>] [<output spec>] [<ship_spec>] [<cache_spec>] <command> ::= standard Unix command including the arguments @@ -90, +90 @@ * '''unordered''' - no guarantees on the order in which the data is delivered to the streaming application * '''grouped''' - the data for the same key is guaranteed to be processed contiguously on a single node - * '''grouped and ordered''' - date is grouped and sorted within a group on user specified key. + * '''grouped and ordered''' - data is grouped and sorted within a group on user specified key. In addition to position, the data grouping and ordering can be determine by the data itself. For now, users would need to know the property of the data to be able to take advantage of its structure; however, eventually, this should be part of metadata. @@ -142, +142 @@ To prevent a command from being shipped, an empty list can be passed to `clause`. + Note that we need to make sure that executables retain their permissions and can be executed on the compute nodes. + ==== 2.2 Ability to cache data ==== The approach described above works fine for binaries/jars and small data sets. For larger datasets, loading them at run time for every execution can have serious performance consequences.