Streaming: better conrol over input splits
------------------------------------------

                 Key: HADOOP-2278
                 URL: https://issues.apache.org/jira/browse/HADOOP-2278
             Project: Hadoop
          Issue Type: Improvement
          Components: contrib/streaming
            Reporter: arkady borkovsky


In steaming, the map command usually expect to receive it's input uninterpreted 
-- just as it is stored in DFS.
However, the split (the beginning and the end of the portion of data that goes 
to a single map task) is often important and is not "any line break".
Often the input consists of multi-line docments -- e.g. in XML.

There should be a way to specify a pattern that separates logical records.
Existing "Streaming XML record reader" kind of provides this functionality.  
However, it is accepted that "Streaming XML" is a hack and needs to be replaced 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to