[ 
https://issues.apache.org/jira/browse/TEZ-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083331#comment-14083331
 ] 

Siddharth Seth commented on TEZ-1343:
-------------------------------------

[~pramachandran] - comments on the patch.
Don't think there's anything absolutely critical, but needs some changes.
- Till this is tested more - unit tests as well as cluster testing, I don't 
think we should enable it for regular runs. A config parameter, which is 
disabled by default should do this. This will have to be enabled explicitly by 
clients when running in local mode.
- The changes in LocalClient actually belong to LocalContainerLauncher - during 
initialization. Also the port should be 0 instead of 53000 as [~jeagles] 
pointed out in TEZ-870.
- FileChunkedPath
-- Can we avoid any kind of URI building in this. The main reason that's used 
is for hashCode / equals. That could be achieved instead by using the uri from 
the parent Path and using the length/offset to come up with a new object which 
can be used for these methods.
-- I don't think toString should be overriden. It's used in places to construct 
other Paths / URIs. This should just be the Path implementation.
-- Eventually, it may be better to change FileChunkedPath to be a wrapper 
around Path rather than extending it. That will change the code extensively, 
but is safer in terms of Path usage itself. This is a separate jira.
- MergeManager "long offset = 0, size = 0;" - should be separate lines. 
Similarly in finalMerge
- When creating the Segment in the main merger - "preserve" needs to be set 
appropriately, like it is in the finalMerge.
- The TezMerger.merge method that is used ends up not sorting the segments - 
which used to be done earlier. That behaviour should be retained - this may 
require a new merge method.
- Another alternate is to change Segment to recognize FileChunkedPath to decide 
offsets, preserve etc, - but that's a bigger change and can be made later.
- ShuffleInputEventHandler "URI baseUri =" can be moved into the else block
- ShuffleInputEventHandler and ShuffleInputEventHandlerImpl -  
getShuffleInputFileName and getIndexRecord (in both ShuffleInputEventHandler 
and ShuffleInputEventHandlerImpl) - the argument to these methods says 'appId'. 
This isn't actually the appId, it's a unique path generated per task by the 
previous step. Should be renamed to something like 'pathComponent' to avoid 
confusion.
- ShuffleInputEventHandler and ShuffleInputEventHandlerImpl - Instead of 
"InetAddress.getLocalHost().getHostName()" - 
"System.getenv(ApplicationConstants.Environment.NM_HOST.toString())" should be 
used. That avoids the InetAddress calls, and ends up making use of the same 
parameter that was used to generate the event in the source task.
- LocalDiskFetchedInput - localFS = FileSystem.get(conf); should be 
FileSystem.getLocal(conf)
- LocalDiskFetchedInput.getInputStream - inputStream.skip(). According to the 
skip contract, this needs to be in a loop. Also, it's not very efficient. This 
can be replaced with "FSDataInputStream inputStream = localFS.open(inputFile); 
inputStream.seek(...)
- LocalDiskFetchedInput - in the toString, it'll be useful to have the filename 
and offset

> Bypass the Fetcher and read directly from the local filesystem if source 
> vertex ran on the same host
> ----------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1343
>                 URL: https://issues.apache.org/jira/browse/TEZ-1343
>             Project: Apache Tez
>          Issue Type: Task
>    Affects Versions: 0.4.1
>            Reporter: Prakash Ramachandran
>            Assignee: Prakash Ramachandran
>         Attachments: TEZ-1343.WIP.1.patch, TEZ-1343.WIP.2.patch
>
>
> In the case of the source and current vertex are on the same host bypass the 
> Fetcher and read it directly from the local filesystem



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to