reduce scans/copies while reading data in hadoop streaming
----------------------------------------------------------
Key: HADOOP-3255
URL: https://issues.apache.org/jira/browse/HADOOP-3255
Project: Hadoop Core
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 0.16.2
Reporter: Joydeep Sen Sarma
follow up from: http://issues.apache.org/jira/browse/HADOOP-2826
we copy over an entire line (from readLine) and then we break it into two
strings by splitting on tab. So there is an extra scan of the input data and an
extra copy based on splitting by tab.
instead if we generalized LineReader to instead read until it hits a delimiter
- then we can do it with one less scan and copy. Something like:
byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n';
newlineDelimiter[1] = '\r';
while() { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key);
lineReader.setDelimiter(newlineDelimiter); lineReader.readLine(value); }
(take my proposed interfaces with a pinch of salt. just to convey the idea).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.