[
https://issues.apache.org/jira/browse/MAPREDUCE-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer updated MAPREDUCE-573:
---------------------------------------
Labels: newbie (was: )
> reduce scans/copies while reading data in hadoop streaming
> ----------------------------------------------------------
>
> Key: MAPREDUCE-573
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-573
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: contrib/streaming
> Reporter: Joydeep Sen Sarma
> Labels: newbie
>
> follow up from: http://issues.apache.org/jira/browse/HADOOP-2826
> we copy over an entire line (from readLine) and then we break it into two
> strings by splitting on tab. So there is an extra scan of the input data and
> an extra copy based on splitting by tab.
> instead if we generalized LineReader to instead read until it hits a
> delimiter - then we can do it with one less scan and copy. Something like:
> byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
> byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n';
> newlineDelimiter[1] = '\r';
> while() { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key);
> lineReader.setDelimiter(newlineDelimiter); lineReader.readLine(value); }
> (take my proposed interfaces with a pinch of salt. just to convey the idea).
--
This message was sent by Atlassian JIRA
(v6.2#6252)