Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigStreamingFunctionalSpec

------------------------------------------------------------------------------
  <input spec>::= input (<input stream spec> [using <serializer>]{, <input 
stream spec> [using <serializer>]})
  <output spec>::= output (<output stream spec> [using <deserializer>]{, 
<output stream spec> [using <deserializer>]})
  <ship spec>::=ship(<file spec>{,<file spec>})
- <cache spec>::=cache(<file spec>{,<file spec>})
+ <cache spec>::=cache(<cache file spec>{,<cache file spec>})
  <input stream spec> ::= stdin | <file spec>
  <output stream spec> ::= stdout | <file spec>
+ <cache file spec>::='<dfs file spec>#<dfs file name>'
  <file spec> ::= unix file path enclised in single quotes
+ <dfs file spec> ::= path specification of the distributed file system
+ <dfs file name> ::= file name on the distributed file system
  <serializer> ::= <udf spec>
  <deserializer> ::= <udf spec>
  }}}
@@ -184, +187 @@

  Similarly to 2.1, a user will be able to specify cached files via `cache` 
clause in the define statement. For instance,
  
  {{{
- define X `stream.pl foo.gz` ship('stream.pl') cache('foo.gz')
+ define X `stream.pl foo.gz` ship('stream.pl') cache('/input/foo.gz#foo.gz')
  }}}
  
- Pig will not ship the files but would expect the files to be available on the 
compute nodes.
+ Pig will not ship the files specified in the cache spec but would expect the 
files to be available on the compute nodes in the specified location. The name 
that follows '#', indicates how the user should refer to the cached file in the 
script. 
  
- If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too since cwd is on the PATH. Note that the symlink feature 
`has to be` used by Pig-Streaming users since they cannot predict the actual 
path of the cached file on the cluster nodes, the symlink is always in the cwd.
- 
- {{{
- define X `./stream.pl` cache('/home/joe/foo#stream.pl')
- }}}
  
  [[Anchor(Input/Output_Handling)]]
  === 3 Input/Output Handling ===

Reply via email to