[
https://issues.apache.org/jira/browse/PIG-201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587658#action_12587658
]
Benjamin Reed commented on PIG-201:
-----------------------------------
The InputStream we get from Hadoop DFS should be buffered, so we don't do extra
buffering in BufferedPositionedInputStream again. This is important because the
buffering needs to be done before the compression codecs so that the
positioning works out properly. Doing it after, like this patch does, will
cause premature detection of end of split.
Having said all that, there obviously is a performance gain to be had. Perhaps
we need to figure out why the buffering done by Hadoop DFS InputStream isn't
helping us. If we do need to buffer, it should go into PigSlice.init() to
buffer fsis.
> BufferedPositionedInputStream is not buffered
> ---------------------------------------------
>
> Key: PIG-201
> URL: https://issues.apache.org/jira/browse/PIG-201
> Project: Pig
> Issue Type: Bug
> Reporter: Mathieu Poumeyrol
> Attachments: BufferedPositionedInputStream.patch
>
>
> BufferedPositionedInputStream is actualy not buffered, leading (I guess) to
> constant round trip to dfs as byte are read one by one. I just wrapped the
> provided input stream in the constructor in a good old BufferedInputStream.
> I measured a 40% performance boost on a script that reads and writes 3.7GB in
> dfs through PigStorage on one node. I guess the impact may be greater on a
> real hdfs cluster with actual network roundtrips.
> FYI, the issue was found while profiling with Yourkit java profiler. Usefull
> toy...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.