[ 
https://issues.apache.org/jira/browse/HIVE-11945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated HIVE-11945:
------------------------------------
    Description: 
When “seek + readFully(buffer, offset, length)” is used,  DFSInputStream ends 
up going via “readWithStrategy()”.  This sets up BlockReader with length 
equivalent to that of the block size. So until this position is reached, 
RemoteBlockReader2.peer would not be added to the PeerCache (Plz refer 
RemoteBlockReader2.close() in HDFS).  So eventually the next call to the same 
DN would end opening a new socket.  In ORC, when it is not a data local read, 
this has a the possibility of opening/closing lots of connections with DN.  

In random reads, it would be good to set this length to the amount of data that 
is to be read (e.g pread call in DFSInputStream which sets up the BlockReader’s 
length correctly & the code path returns the Peer back to peer cache properly). 
 “readFully(position, buffer, offset, length)” follows this code path and ends 
up reusing the connections properly. Creating this JIRA to fix this issue.


  was:
When “seek + readFully(buffer, offset, length)” is used,  DFSInputStream ends 
up going via “readWithStrategy()”.  This sets up BlockReader with length 
equivalent to that of the block size. So until this position is reached, 
RemoteBlockReader2.peer would not be added to the PeerCache (Plz refer 
RemoteBlockReader2.close() in HDFS).  So eventually the next call to the same 
DN would end opening a new socket.  In ORC, when it is not a data local read, 
this has a the possibility of opening/closing lots of connections with DN.  

In random reads, it would be good to set this length to the amount f data that 
is to be read (e.g pread call in DFSInputStream which sets up the BlockReader’s 
length correctly & the code path returns the Peer back to peer cache properly). 
 “readFully(position, buffer, offset, length)” follows this code path and ends 
up reusing the connections properly. Creating this JIRA to fix this issue.



> ORC with non-local reads may not be reusing connection to DN
> ------------------------------------------------------------
>
>                 Key: HIVE-11945
>                 URL: https://issues.apache.org/jira/browse/HIVE-11945
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: HIVE-11945.1.patch
>
>
> When “seek + readFully(buffer, offset, length)” is used,  DFSInputStream ends 
> up going via “readWithStrategy()”.  This sets up BlockReader with length 
> equivalent to that of the block size. So until this position is reached, 
> RemoteBlockReader2.peer would not be added to the PeerCache (Plz refer 
> RemoteBlockReader2.close() in HDFS).  So eventually the next call to the same 
> DN would end opening a new socket.  In ORC, when it is not a data local read, 
> this has a the possibility of opening/closing lots of connections with DN.  
> In random reads, it would be good to set this length to the amount of data 
> that is to be read (e.g pread call in DFSInputStream which sets up the 
> BlockReader’s length correctly & the code path returns the Peer back to peer 
> cache properly).  “readFully(position, buffer, offset, length)” follows this 
> code path and ends up reusing the connections properly. Creating this JIRA to 
> fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to