Re: How-to use DFSClient's BlockReader from Java

Joey Echeverria Wed, 11 Jan 2012 08:34:45 -0800

Yup, just start reading from wherever the block starts and stop at the
end of the block to do local reads.


-Joey

On Wed, Jan 11, 2012 at 11:31 AM, David Pavlis <david.pav...@javlin.eu> wrote:
> Hi Todd,
>
> If I use the FileSystem API and I am on local node - how do I get/read just
> that particular block residing locally on that node ? Do I just open the
> FSDataInputStream and seek() to the position where the block starts ?
>
> Thanks,
>
> David.
>
>
>
> On 10.1.12 7:19 PM, "Todd Lipcon" <t...@cloudera.com> wrote:
>
>>On Tue, Jan 10, 2012 at 4:32 AM, David Pavlis <david.pav...@javlin.eu>
>>wrote:
>>> Hi Todd,
>>>
>>> Understand, I will re-think the MR approach. Nonetheless I like the idea
>>> with getting the block id and accessing the locally stored file directly
>>
>>You really don't want to do this - what happens when the balancer
>>moves the block from under you?
>>
>>> Is there a way (public interface) to find for particular datanode, where
>>> the local storage is rooted (which local subdirs it uses) ?
>>> I found that this info is available in Storage class, but is there a
>>> "public" way of getting it - through some protocol or so ?
>>
>>No, it's also a private API. If you use the FileSystem API, and you're
>>on the local node, all of these optimizations will happen
>>automatically for you, plus they'll keep working when you upgrade.
>>Even MapReduce just uses the public APIs.
>>
>>-Todd
>>
>>>
>>>
>>> On 9.1.12 9:30 PM, "Todd Lipcon" <t...@cloudera.com> wrote:
>>>
>>>>Hi David,
>>>>
>>>>I'd definitely recommend using MapReduce for this. What you've
>>>>described is essentially identical to MR.
>>>>
>>>>Otherwise, you should use the public API
>>>>FileSystem.getFileBlockLocations(), and then read the host names out
>>>>of the returned BlockLocation struct. Then just use a normal
>>>>FileSystem open call from that node - it will automatically pick the
>>>>local replica for you without any further work.
>>>>
>>>>-Todd
>>>>
>>>>On Mon, Jan 9, 2012 at 12:01 PM, David Pavlis <david.pav...@javlin.eu>
>>>>wrote:
>>>>> Hi Todd,
>>>>>
>>>>> Thanks for letting me know.  OK - here is what I am trying to do (it
>>>>>is
>>>>>a
>>>>> POC for now):
>>>>>
>>>>> We have an ETL framework which helps with transforming data - parsing
>>>>> various formats, reading from DBs,
>>>>> aggregating, sorting, etc..
>>>>> We do have currently concept of a "cluster" which basically allows
>>>>>input
>>>>> data (say datafile) be split/partitioned
>>>>> across several nodes and then one data transformation is executed on
>>>>>those
>>>>> data. The way it works is that
>>>>> we can analyze what the transformation does and if it is supposed to
>>>>> consume data which is spread over cluster, we
>>>>> execute a slightly modified copy/instance of that transformation on
>>>>>each
>>>>> node of the cluster where some piece/partition
>>>>> of data resides. We do not have concept of any "clustered" filesystem
>>>>>-
>>>>> our partitioned data reside in ordinary files
>>>>> and there is no metadata layer on top. If we need one single output
>>>>>data
>>>>> file, then we just perform merge operation. It is
>>>>> quite limiting as if we need to manipulate such data, we need to do it
>>>>> piece by piece (on each participating node).
>>>>>
>>>>> So we essentially do split-transform-merge with merge being optional
>>>>>(can
>>>>> be part of the transformation directly, so we don't
>>>>> create temp files).
>>>>>
>>>>> Here is the idea with HDFS - each of our transformation node becomes
>>>>> HDFS's datanode. Then if we are to process particular
>>>>> input data split over several datanodes, then we just instruct our
>>>>> transformation nodes to read specific block of the file (block/s
>>>>> which happens to be on the same physical machine as our transformation
>>>>> node is also datanode) -hence my interest in BlockReader.
>>>>>
>>>>> I was also considering wrapping our transformation job into map-reduce
>>>>>job
>>>>> of Hadoop, but that seems a bit limiting and also
>>>>> we would need to "take" the whole Hadoop stack and give it control
>>>>>over
>>>>> our jobs. But that still might be the right way.
>>>>> Also, I need to solve writing of partitioned data - here I would like
>>>>>to
>>>>> control the block allocation somehow as ideally transformation
>>>>> running on particular node would be reading locally stored blocks and
>>>>> outputting data to locally allocated block of HDFS file.
>>>>>
>>>>> Well, I hope I explained the situation clearly enough.
>>>>>
>>>>> I will be thankful for any comments.
>>>>>
>>>>> Regards,
>>>>>
>>>>> David.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9.1.12 6:59 PM, "Todd Lipcon" <t...@cloudera.com> wrote:
>>>>>
>>>>>>Hi David,
>>>>>>
>>>>>>For what it's worth, you should be aware that you're calling internal
>>>>>>APIs that have no guarantee of stability between versions. I can
>>>>>>practically guarantee that your code will have to be modified for any
>>>>>>HDFS upgrade you do. That's why these APIs are undocumented.
>>>>>>
>>>>>>Perhaps you can explain what your high-level goal is, here, and we can
>>>>>>suggest a supported mechanism for achieving it.
>>>>>>
>>>>>>-Todd
>>>>>>
>>>>>>On Mon, Jan 9, 2012 at 9:56 AM, David Pavlis <david.pav...@javlin.eu>
>>>>>>wrote:
>>>>>>> Hi Denny,
>>>>>>>
>>>>>>> Thanks a lot. I was able to make my code work.
>>>>>>>
>>>>>>> I am posting a small example below - in case somebody in the future
>>>>>>>has
>>>>>>> similar need ;-)
>>>>>>> (not handling replica datablocks).
>>>>>>>
>>>>>>> David.
>>>>>>>
>>>>>>>
>>>>>>>*********************************************************************
>>>>>>>**
>>>>>>>**
>>>>>>>**
>>>>>>> public static void main(String args[]){
>>>>>>>        String
>>>>>>>filename="/user/hive/warehouse/sample_07/sample_07.csv";
>>>>>>>        int DATANODE_PORT = 50010;
>>>>>>>        int NAMENODE_PORT = 8020;
>>>>>>>        String HOST_IP = "192.168.1.230";
>>>>>>>
>>>>>>>        byte[] buf=new byte[1000];
>>>>>>>
>>>>>>>
>>>>>>>        try{
>>>>>>>
>>>>>>>                ClientProtocol client= DFSClient.createNamenode(new
>>>>>>> InetSocketAddress(HOST_IP,NAMENODE_PORT), new Configuration());
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                LocatedBlocks located=
>>>>>>>client.getBlockLocations(filename, 0,
>>>>>>> Long.MAX_VALUE);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                for(LocatedBlock block : located.getLocatedBlocks()){
>>>>>>>                        Socket sock =
>>>>>>>SocketFactory.getDefault().createSocket();
>>>>>>>                        InetSocketAddress targetAddr = new
>>>>>>> InetSocketAddress(HOST_IP,DATANODE_PORT);
>>>>>>>                        NetUtils.connect(sock, targetAddr, 10000);
>>>>>>>                        sock.setSoTimeout(10000);
>>>>>>>
>>>>>>>
>>>>>>>                        BlockReader
>>>>>>>reader=BlockReader.newBlockReader(sock,  filename,
>>>>>>>                                block.getBlock().getBlockId(),
>>>>>>>block.getBlockToken(),
>>>>>>> block.getBlock().getGenerationStamp(), 0,
>>>>>>>                block.getBlockSize(),
>>>>>>> 1000);
>>>>>>>
>>>>>>>
>>>>>>>                        int count=0;
>>>>>>>                        int length;
>>>>>>>                        while((length=reader.read(buf,0,1000))>0){
>>>>>>>                                //System.out.print(new
>>>>>>>String(buf,0,length,"UTF-8"));
>>>>>>>                                if (length<1000) break;
>>>>>>>                        }
>>>>>>>                        reader.close();
>>>>>>>                        sock.close();
>>>>>>>                }
>>>>>>>
>>>>>>>
>>>>>>>        }catch(IOException ex){
>>>>>>>                ex.printStackTrace();
>>>>>>>        }
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>*********************************************************************
>>>>>>>**
>>>>>>>**
>>>>>>>**
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From:  Denny Ye <denny...@gmail.com>
>>>>>>> Reply-To:  <hdfs-user@hadoop.apache.org>
>>>>>>> Date:  Mon, 9 Jan 2012 16:29:18 +0800
>>>>>>> To:  <hdfs-user@hadoop.apache.org>
>>>>>>> Subject:  Re: How-to use DFSClient's BlockReader from Java
>>>>>>>
>>>>>>>
>>>>>>> hi David     Please refer to the method
>>>>>>>"DFSInputStream#blockSeekTo",
>>>>>>>it
>>>>>>> has same purpose with you.
>>>>>>>
>>>>>>>
>>>>>>>*********************************************************************
>>>>>>>**
>>>>>>>**
>>>>>>>**
>>>>>>>        LocatedBlock targetBlock = getBlockAt(target, true);
>>>>>>>        assert (target==this.pos) : "Wrong postion " + pos + " expect
>>>>>>>" +
>>>>>>> target;
>>>>>>>        long offsetIntoBlock = target - targetBlock.getStartOffset();
>>>>>>>
>>>>>>>        DNAddrPair retval = chooseDataNode(targetBlock);
>>>>>>>        chosenNode = retval.info <http://retval.info>;
>>>>>>>        InetSocketAddress targetAddr = retval.addr;
>>>>>>>
>>>>>>>        try {
>>>>>>>          s = socketFactory.createSocket();
>>>>>>>          NetUtils.connect(s, targetAddr, socketTimeout);
>>>>>>>          s.setSoTimeout(socketTimeout);
>>>>>>>          Block blk = targetBlock.getBlock();
>>>>>>>          Token<BlockTokenIdentifier> accessToken =
>>>>>>> targetBlock.getBlockToken();
>>>>>>>
>>>>>>>          blockReader = BlockReader.newBlockReader(s, src,
>>>>>>> blk.getBlockId(),
>>>>>>>              accessToken,
>>>>>>>              blk.getGenerationStamp(),
>>>>>>>              offsetIntoBlock, blk.getNumBytes() - offsetIntoBlock,
>>>>>>>              buffersize, verifyChecksum, clientName);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>*********************************************************************
>>>>>>>**
>>>>>>>**
>>>>>>>**
>>>>>>>
>>>>>>>
>>>>>>> -Regards
>>>>>>> Denny Ye
>>>>>>>
>>>>>>> 2012/1/6 David Pavlis <david.pav...@javlin.eu>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am relatively new to Hadoop and I am trying to utilize HDFS for
>>>>>>>own
>>>>>>> application where I want to take advantage of data partitioning HDFS
>>>>>>> performs.
>>>>>>>
>>>>>>> The idea is that I get list of individual blocks - BlockLocations of
>>>>>>> particular file and then directly read those (go to individual
>>>>>>>DataNodes).
>>>>>>> So far I found org.apache.hadoop.hdfs.DFSClient.BlockReader to be
>>>>>>>the
>>>>>>>way
>>>>>>> to go.
>>>>>>>
>>>>>>> However I am struggling with instantiating the BlockReader() class,
>>>>>>>namely
>>>>>>> creating the "Token<BlockTokenIdentifier>".
>>>>>>>
>>>>>>> Is there an example Java code showing how to access individual
>>>>>>>blocks
>>>>>>>of
>>>>>>> particular file stored on HDFS ?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> David.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>--
>>>>>>Todd Lipcon
>>>>>>Software Engineer, Cloudera
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>Todd Lipcon
>>>>Software Engineer, Cloudera
>>>
>>>
>>
>>
>>
>>--
>>Todd Lipcon
>>Software Engineer, Cloudera
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: How-to use DFSClient's BlockReader from Java

Reply via email to