Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Adam Faris Tue, 08 May 2012 11:22:58 -0700

Hi Austin,

I'm glad that helped out.  Regarding the -p flag for distcp, here's the online 
documentation


http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index

You can also get this info from running 'hadoop distcp' without any flags.
--------
-p[rbugp]       Preserve
                       r: replication number
                       b: block size
                       u: user
                       g: group
                       p: permission
--------

-- Adam

On May 7, 2012, at 10:55 PM, Austin Chungath wrote:

> Thanks Adam,
> 
> That was very helpful. Your second point solved my problems :-)
> The hdfs port number was wrong.
> I didn't use the option -ppgu what does it do?
> 
> 
> 
> On Mon, May 7, 2012 at 8:07 PM, Adam Faris <afa...@linkedin.com> wrote:
> 
>> Hi Austin,
>> 
>> I don't know about using CDH3, but we use distcp for moving data between
>> different versions of apache grids and several things come to mind.
>> 
>> 1) you should use the -i flag to ignore checksum differences on the
>> blocks.  I'm not 100% but want to say hftp doesn't support checksums on the
>> blocks as they go across the wire.
>> 
>> 2) you should read from hftp but write to hdfs.  Also make sure to check
>> your port numbers.   For example I can read from hftp on port 50070 and
>> write to hdfs on port 9000.  You'll find the hftp port in hdfs-site.xml and
>> hdfs in core-site.xml on apache releases.
>> 
>> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support
>> security?  If security is enabled on 0.20.205 and CDH3 does not support
>> security, you will need to disable security on 0.20.205.  This is because
>> you are unable to write from a secure to unsecured grid.
>> 
>> 4) use the -m flag to limit your mappers so you don't DDOS your network
>> backbone.
>> 
>> 5) why isn't your vender helping you with the data migration? :)
>> 
>> Otherwise something like this should get you going.
>> 
>> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp
>> hftp://mynamenode.grid.one:50070/path/to/my/src/data
>> hdfs://mynamenode.grid.two:9000/path/to/my/dst
>> 
>> -- Adam
>> 
>> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote:
>> 
>>> things to check
>>> 
>>> 1) when you launch distcp jobs all the datanodes of older hdfs are live
>> and
>>> connected
>>> 2) when you launch distcp no data is being written/moved/deleteed in hdfs
>>> 3)  you can use option -log to log errors into directory and user -i to
>>> ignore errors
>>> 
>>> also u can try using distcp with hdfs protocol instead of hftp  ... for
>>> more you can refer
>>> 
>> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd
>>> 
>>> 
>>> 
>>> if it failed there should be some error
>>> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <austi...@gmail.com>
>> wrote:
>>> 
>>>> ok that was a lame mistake.
>>>> $ hadoop distcp hftp://localhost:50070/tmp
>> hftp://localhost:60070/tmp_copy
>>>> I had spelled hdfs instead of "hftp"
>>>> 
>>>> $ hadoop distcp hftp://localhost:50070/docs/index.html
>>>> hftp://localhost:60070/user/hadoop
>>>> 12/05/07 16:38:09 INFO tools.DistCp:
>>>> srcPaths=[hftp://localhost:50070/docs/index.html]
>>>> 12/05/07 16:38:09 INFO tools.DistCp:
>>>> destPath=hftp://localhost:60070/user/hadoop
>>>> With failures, global counters are inaccurate; consider running with -i
>>>> Copy failed: java.io.IOException: Not supported
>>>> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457)
>>>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963)
>>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672)
>>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
>>>> 
>>>> Any idea why this error is coming?
>>>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3
>>>> (/user/hadoop)
>>>> 
>>>> Thanks & Regards,
>>>> Austin
>>>> 
>>>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <austi...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Thanks,
>>>>> 
>>>>> So I decided to try and move using distcp.
>>>>> 
>>>>> $ hadoop distcp hdfs://localhost:54310/tmp
>> hdfs://localhost:8021/tmp_copy
>>>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>>> srcPaths=[hdfs://localhost:54310/tmp]
>>>>> 12/05/07 14:57:38 INFO tools.DistCp:
>>>>> destPath=hdfs://localhost:8021/tmp_copy
>>>>> With failures, global counters are inaccurate; consider running with -i
>>>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
>>>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
>> (client
>>>> =
>>>>> 63, server = 61)
>>>>> 
>>>>> I found that we can do distcp like above only if both are of the same
>>>>> hadoop version.
>>>>> so I tried:
>>>>> 
>>>>> $ hadoop distcp hftp://localhost:50070/tmp
>>>> hdfs://localhost:60070/tmp_copy
>>>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>>> srcPaths=[hftp://localhost:50070/tmp]
>>>>> 12/05/07 15:02:44 INFO tools.DistCp:
>>>>> destPath=hdfs://localhost:60070/tmp_copy
>>>>> 
>>>>> But this process seemed to be hangs at this stage. What might I be
>> doing
>>>>> wrong?
>>>>> 
>>>>> hftp://<dfs.http.address>/<path>
>>>>> hftp://localhost:50070 is dfs.http.address of 0.20.205
>>>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3
>>>>> 
>>>>> Thanks and regards,
>>>>> Austin
>>>>> 
>>>>> 
>>>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel <
>> michael_se...@hotmail.com
>>>>> wrote:
>>>>> 
>>>>>> Ok... So riddle me this...
>>>>>> I currently have a replication factor of 3.
>>>>>> I reset it to two.
>>>>>> 
>>>>>> What do you have to do to get the replication factor of 3 down to 2?
>>>>>> Do I just try to rebalance the nodes?
>>>>>> 
>>>>>> The point is that you are looking at a very small cluster.
>>>>>> You may want to start the be cluster with a replication factor of 2
>> and
>>>>>> then when the data is moved over, increase it to a factor of 3. Or
>> maybe
>>>>>> not.
>>>>>> 
>>>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck
>> for
>>>> a
>>>>>> sanity check and then remove the files I copied. As I gain more room,
>> I
>>>> can
>>>>>> then slowly drop nodes, do an fsck, rebalance and then repeat.
>>>>>> 
>>>>>> Even though this us a dev cluster, the OP wants to retain the data.
>>>>>> 
>>>>>> There are other options depending on the amount and size of new
>>>> hardware.
>>>>>> I mean make one machine a RAID 5 machine, copy data to it clearing off
>>>>>> the cluster.
>>>>>> 
>>>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used.
>>>>>> Let's say 3TB. Going raid 5, how much disk is that?  So you could fit
>> it
>>>>>> on one machine, depending on hardware, or maybe 2 machines...  Now you
>>>> can
>>>>>> rebuild initial cluster and then move data back. Then rebuild those
>>>>>> machines. Lots of options... ;-)
>>>>>> 
>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>> 
>>>>>> Mike Segel
>>>>>> 
>>>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <sur...@hortonworks.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> This probably is a more relevant question in CDH mailing lists. That
>>>>>> said,
>>>>>>> what Edward is suggesting seems reasonable. Reduce replication
>> factor,
>>>>>>> decommission some of the nodes and create a new cluster with those
>>>> nodes
>>>>>>> and do distcp.
>>>>>>> 
>>>>>>> Could you share with us the reasons you want to migrate from Apache
>>>> 205?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Suresh
>>>>>>> 
>>>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
>>>> edlinuxg...@gmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
>>>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it
>>>>>>>> out. But yes like Michael said two clusters on the same gear and
>>>>>>>> distcp. If you are using RF=3 you could also lower your replication
>>>> to
>>>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
>>>>>>>> stuff.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
>>>>>> michael_se...@hotmail.com>
>>>>>>>> wrote:
>>>>>>>>> Ok... When you get your new hardware...
>>>>>>>>> 
>>>>>>>>> Set up one server as your new NN, JT, SN.
>>>>>>>>> Set up the others as a DN.
>>>>>>>>> (Cloudera CDH3u3)
>>>>>>>>> 
>>>>>>>>> On your existing cluster...
>>>>>>>>> Remove your old log files, temp files on HDFS anything you don't
>>>> need.
>>>>>>>>> This should give you some more space.
>>>>>>>>> Start copying some of the directories/files to the new cluster.
>>>>>>>>> As you gain space, decommission a node, rebalance, add node to new
>>>>>>>> cluster...
>>>>>>>>> 
>>>>>>>>> It's a slow process.
>>>>>>>>> 
>>>>>>>>> Should I remind you to make sure you up you bandwidth setting, and
>>>> to
>>>>>>>> clean up the hdfs directories when you repurpose the nodes?
>>>>>>>>> 
>>>>>>>>> Does this make sense?
>>>>>>>>> 
>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>> 
>>>>>>>>> Mike Segel
>>>>>>>>> 
>>>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <austi...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Yeah I know :-)
>>>>>>>>>> and this is not a production cluster ;-) and yes there is more
>>>>>> hardware
>>>>>>>>>> coming :-)
>>>>>>>>>> 
>>>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
>>>>>> michael_se...@hotmail.com
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Well, you've kind of painted yourself in to a corner...
>>>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists,
>>>> but
>>>>>>>> it's a
>>>>>>>>>>> generic question...
>>>>>>>>>>> 
>>>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual
>> disks?
>>>>>>>>>>> And please tell me you've already ordered more hardware.. Right?
>>>>>>>>>>> 
>>>>>>>>>>> And please tell me this isn't your production cluster...
>>>>>>>>>>> 
>>>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept
>> my
>>>>>>>>>>> upcoming proposal talk... ;-)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>>> 
>>>>>>>>>>> Mike Segel
>>>>>>>>>>> 
>>>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <austi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There
>>>>>> were no
>>>>>>>>>>>> responses.
>>>>>>>>>>>> 
>>>>>>>>>>>> But this is not related to cloudera as such.
>>>>>>>>>>>> 
>>>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in
>>>>>> apache
>>>>>>>>>>>> hadoop 0.20.205
>>>>>>>>>>>> 
>>>>>>>>>>>> There is an upgrade namenode option when we are migrating to a
>>>>>> higher
>>>>>>>>>>>> version say from 0.20 to 0.20.205
>>>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3)
>>>>>>>>>>>> Is this possible?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi <
>>>>>>>> prash1...@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so
>>>> would
>>>>>> not
>>>>>>>>>>> know
>>>>>>>>>>>>> much, but you might find some help moving this to Cloudera
>>>> mailing
>>>>>>>> list.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath <
>>>>>> austi...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is only one cluster. I am not copying between clusters.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB
>> storage
>>>>>>>>>>> capacity
>>>>>>>>>>>>>> and has about 8 TB of data.
>>>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use
>> that
>>>>>>>> same 8
>>>>>>>>>>> TB
>>>>>>>>>>>>>> of data.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2
>> TB
>>>>>> of
>>>>>>>> free
>>>>>>>>>>>>>> space
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar <
>>>>>>>> nitinpawar...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> you can actually look at the distcp
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> but this means that you have two different set of clusters
>>>>>>>> available
>>>>>>>>>>> to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> the migration
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath <
>>>>>>>> austi...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the suggestions,
>>>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the
>>>> dfs
>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> data is huge.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I
>>>> can
>>>>>> do
>>>>>>>> a
>>>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to
>> use
>>>>>> CDH3
>>>>>>>>>>>>> now,
>>>>>>>>>>>>>>>> which is based on 0.20
>>>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info
>>>> has
>>>>>> to
>>>>>>>> be
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> by 0.20's namenode.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar <
>>>>>>>>>>>>> nitinpawar...@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> i can think of following options
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from
>>>>>> DFS
>>>>>>>> and
>>>>>>>>>>>>>>> loads
>>>>>>>>>>>>>>>>> it in dfs
>>>>>>>>>>>>>>>>> 2) see if the distcp  between both versions are compatible
>>>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few
>>>> hundred
>>>>>>>> GB)
>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>>> did a
>>>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a
>>>> copyFromLocal
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath <
>>>>>>>>>>>>> austi...@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
>>>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of
>> Apache
>>>>>>>>>>>>> hadoop
>>>>>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have
>> on
>>>>>>>>>>>>>> 0.20.205.
>>>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>>>>>> Austin
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Nitin Pawar
>> 
>>

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

Reply via email to