[
https://issues.apache.org/jira/browse/TEZ-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083331#comment-14083331
]
Siddharth Seth commented on TEZ-1343:
-------------------------------------
[~pramachandran] - comments on the patch.
Don't think there's anything absolutely critical, but needs some changes.
- Till this is tested more - unit tests as well as cluster testing, I don't
think we should enable it for regular runs. A config parameter, which is
disabled by default should do this. This will have to be enabled explicitly by
clients when running in local mode.
- The changes in LocalClient actually belong to LocalContainerLauncher - during
initialization. Also the port should be 0 instead of 53000 as [~jeagles]
pointed out in TEZ-870.
- FileChunkedPath
-- Can we avoid any kind of URI building in this. The main reason that's used
is for hashCode / equals. That could be achieved instead by using the uri from
the parent Path and using the length/offset to come up with a new object which
can be used for these methods.
-- I don't think toString should be overriden. It's used in places to construct
other Paths / URIs. This should just be the Path implementation.
-- Eventually, it may be better to change FileChunkedPath to be a wrapper
around Path rather than extending it. That will change the code extensively,
but is safer in terms of Path usage itself. This is a separate jira.
- MergeManager "long offset = 0, size = 0;" - should be separate lines.
Similarly in finalMerge
- When creating the Segment in the main merger - "preserve" needs to be set
appropriately, like it is in the finalMerge.
- The TezMerger.merge method that is used ends up not sorting the segments -
which used to be done earlier. That behaviour should be retained - this may
require a new merge method.
- Another alternate is to change Segment to recognize FileChunkedPath to decide
offsets, preserve etc, - but that's a bigger change and can be made later.
- ShuffleInputEventHandler "URI baseUri =" can be moved into the else block
- ShuffleInputEventHandler and ShuffleInputEventHandlerImpl -
getShuffleInputFileName and getIndexRecord (in both ShuffleInputEventHandler
and ShuffleInputEventHandlerImpl) - the argument to these methods says 'appId'.
This isn't actually the appId, it's a unique path generated per task by the
previous step. Should be renamed to something like 'pathComponent' to avoid
confusion.
- ShuffleInputEventHandler and ShuffleInputEventHandlerImpl - Instead of
"InetAddress.getLocalHost().getHostName()" -
"System.getenv(ApplicationConstants.Environment.NM_HOST.toString())" should be
used. That avoids the InetAddress calls, and ends up making use of the same
parameter that was used to generate the event in the source task.
- LocalDiskFetchedInput - localFS = FileSystem.get(conf); should be
FileSystem.getLocal(conf)
- LocalDiskFetchedInput.getInputStream - inputStream.skip(). According to the
skip contract, this needs to be in a loop. Also, it's not very efficient. This
can be replaced with "FSDataInputStream inputStream = localFS.open(inputFile);
inputStream.seek(...)
- LocalDiskFetchedInput - in the toString, it'll be useful to have the filename
and offset
> Bypass the Fetcher and read directly from the local filesystem if source
> vertex ran on the same host
> ----------------------------------------------------------------------------------------------------
>
> Key: TEZ-1343
> URL: https://issues.apache.org/jira/browse/TEZ-1343
> Project: Apache Tez
> Issue Type: Task
> Affects Versions: 0.4.1
> Reporter: Prakash Ramachandran
> Assignee: Prakash Ramachandran
> Attachments: TEZ-1343.WIP.1.patch, TEZ-1343.WIP.2.patch
>
>
> In the case of the source and current vertex are on the same host bypass the
> Fetcher and read it directly from the local filesystem
--
This message was sent by Atlassian JIRA
(v6.2#6252)