[
https://issues.apache.org/jira/browse/HADOOP-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491870
]
eric baldeschwieler commented on HADOOP-1296:
---------------------------------------------
So are we comfortable returning 100s of thousands of records in a single RPC
from the name node? Would it be better to return a max of 10k record at a time
or some such limit with a clear restart policy? Or is it ok for a client to
open a socket and suck that much data in one session. Clearly more RPCs is
more aggregate work, just wondering about starvation, locking, CPU spikes and
all the usual suspects.
> Improve interface to FileSystem.getFileCacheHints
> -------------------------------------------------
>
> Key: HADOOP-1296
> URL: https://issues.apache.org/jira/browse/HADOOP-1296
> Project: Hadoop
> Issue Type: Improvement
> Components: fs
> Reporter: Owen O'Malley
> Assigned To: dhruba borthakur
>
> The FileSystem interface provides a very limited interface for finding the
> location of the data. The current method looks like:
> String[][] getFileCacheHints(Path file, long start, long len) throws
> IOException
> which returns a list of "block info" where the block info consists of a list
> host names. Because the hints don't include the information about where the
> block boundaries are, map/reduce is required to call the name node for each
> split. I'd propose that we fix the naming a bit and make it:
> public class BlockInfo extends Writable {
> public long getStart();
> public String[] getHosts();
> }
> BlockInfo[] getFileHints(Path file, long start, long len) throws IOException;
> So that map/reduce can query about the entire file and get the locations in a
> single call.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.