[ 
https://issues.apache.org/jira/browse/MAPREDUCE-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated MAPREDUCE-573:
---------------------------------------

    Labels: newbie  (was: )

> reduce scans/copies while reading data in hadoop streaming
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-573
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-573
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/streaming
>            Reporter: Joydeep Sen Sarma
>              Labels: newbie
>
> follow up from: http://issues.apache.org/jira/browse/HADOOP-2826
> we copy over an entire line (from readLine) and then we break it into two 
> strings by splitting on tab. So there is an extra scan of the input data and 
> an extra copy based on splitting by tab.
> instead if we generalized LineReader to instead read until it hits a 
> delimiter - then we can do it with one less scan and copy. Something like:
> byte [] tabDelimiter = new byte [1]; tabDelimiter[0] = '\t';
> byte [] newlineDelimiter = new byte[2]; newlineDelimiter[0] = '\n'; 
> newlineDelimiter[1] = '\r';
> while() { lineReader.setDelimiter(tabDelimiter); lineReader.readLine(key); 
> lineReader.setDelimiter(newlineDelimiter); lineReader.readLine(value); }
> (take my proposed interfaces with a pinch of salt. just to convey the idea).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to