DEFINE streaming options are ill defined and not properly documented
--------------------------------------------------------------------
Key: PIG-1622
URL: https://issues.apache.org/jira/browse/PIG-1622
Project: Pig
Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Xuefu Zhang
Priority: Minor
Fix For: 0.9.0
According to the documentation
(http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DEFINE) the syntax
for DEFINE when used to define a streaming command is:
DEFINE cmd INPUT(stdin|path) OUTPUT(stdout|stderr|path) SHIP(path [, path,
...]) CACHE (path [, path, ...])
However, the actual parser accepts something pretty different. Consider the
following script:
{code}
define strm `wc -l` INPUT(stdin)
CACHE('/Users/gates/.vimrc#myvim')
OUTPUT(stdin)
INPUT('/tmp/fred')
OUTPUT('/tmp/bob')
SHIP('/Users/gates/.bashrc')
SHIP('/Users/gates/.vimrc')
CACHE('/Users/gates/.bashrc#mybash')
stderr('/tmp/errors' limit 10);
A = load '/Users/gates/test/data/studenttab10';
B = stream A through strm;
dump B;
{code}
The above actually parsers. I see several issues here:
# What do multiple INPUT and OUTPUT statements mean in the context of
streaming? These should not be allowed.
# The documentation implies an order (INPUT, OUTPUT, SHIP, CACHE) that is not
enforced by the parser. We should either enforce the order in the parser or
update the documentation. Most likely the latter to avoid breaking existing
scripts.
# Why are multiple SHIP and CACHE clauses allowed when each can take multiple
paths? It seems we should only allow one of each.
# The error clause is completely different that what is given in the
documentation. I suspect this is a documentation error and the grammar
supported by the parser here is what we want.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.