Hi Austin, I'm glad that helped out. Regarding the -p flag for distcp, here's the online documentation
http://hadoop.apache.org/common/docs/current/distcp.html#Option+Index You can also get this info from running 'hadoop distcp' without any flags. -------- -p[rbugp] Preserve r: replication number b: block size u: user g: group p: permission -------- -- Adam On May 7, 2012, at 10:55 PM, Austin Chungath wrote: > Thanks Adam, > > That was very helpful. Your second point solved my problems :-) > The hdfs port number was wrong. > I didn't use the option -ppgu what does it do? > > > > On Mon, May 7, 2012 at 8:07 PM, Adam Faris <afa...@linkedin.com> wrote: > >> Hi Austin, >> >> I don't know about using CDH3, but we use distcp for moving data between >> different versions of apache grids and several things come to mind. >> >> 1) you should use the -i flag to ignore checksum differences on the >> blocks. I'm not 100% but want to say hftp doesn't support checksums on the >> blocks as they go across the wire. >> >> 2) you should read from hftp but write to hdfs. Also make sure to check >> your port numbers. For example I can read from hftp on port 50070 and >> write to hdfs on port 9000. You'll find the hftp port in hdfs-site.xml and >> hdfs in core-site.xml on apache releases. >> >> 3) Do you have security (kerberos) enabled on 0.20.205? Does CDH3 support >> security? If security is enabled on 0.20.205 and CDH3 does not support >> security, you will need to disable security on 0.20.205. This is because >> you are unable to write from a secure to unsecured grid. >> >> 4) use the -m flag to limit your mappers so you don't DDOS your network >> backbone. >> >> 5) why isn't your vender helping you with the data migration? :) >> >> Otherwise something like this should get you going. >> >> hadoop -i -ppgu -log /tmp/mylog -m 20 distcp >> hftp://mynamenode.grid.one:50070/path/to/my/src/data >> hdfs://mynamenode.grid.two:9000/path/to/my/dst >> >> -- Adam >> >> On May 7, 2012, at 4:29 AM, Nitin Pawar wrote: >> >>> things to check >>> >>> 1) when you launch distcp jobs all the datanodes of older hdfs are live >> and >>> connected >>> 2) when you launch distcp no data is being written/moved/deleteed in hdfs >>> 3) you can use option -log to log errors into directory and user -i to >>> ignore errors >>> >>> also u can try using distcp with hdfs protocol instead of hftp ... for >>> more you can refer >>> >> https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/d0d99ad9f1554edd >>> >>> >>> >>> if it failed there should be some error >>> On Mon, May 7, 2012 at 4:44 PM, Austin Chungath <austi...@gmail.com> >> wrote: >>> >>>> ok that was a lame mistake. >>>> $ hadoop distcp hftp://localhost:50070/tmp >> hftp://localhost:60070/tmp_copy >>>> I had spelled hdfs instead of "hftp" >>>> >>>> $ hadoop distcp hftp://localhost:50070/docs/index.html >>>> hftp://localhost:60070/user/hadoop >>>> 12/05/07 16:38:09 INFO tools.DistCp: >>>> srcPaths=[hftp://localhost:50070/docs/index.html] >>>> 12/05/07 16:38:09 INFO tools.DistCp: >>>> destPath=hftp://localhost:60070/user/hadoop >>>> With failures, global counters are inaccurate; consider running with -i >>>> Copy failed: java.io.IOException: Not supported >>>> at org.apache.hadoop.hdfs.HftpFileSystem.delete(HftpFileSystem.java:457) >>>> at org.apache.hadoop.tools.DistCp.fullyDelete(DistCp.java:963) >>>> at org.apache.hadoop.tools.DistCp.copy(DistCp.java:672) >>>> at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>> at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) >>>> >>>> Any idea why this error is coming? >>>> I am copying one file from 0.20.205 (/docs/index.html ) to cdh3u3 >>>> (/user/hadoop) >>>> >>>> Thanks & Regards, >>>> Austin >>>> >>>> On Mon, May 7, 2012 at 3:57 PM, Austin Chungath <austi...@gmail.com> >>>> wrote: >>>> >>>>> Thanks, >>>>> >>>>> So I decided to try and move using distcp. >>>>> >>>>> $ hadoop distcp hdfs://localhost:54310/tmp >> hdfs://localhost:8021/tmp_copy >>>>> 12/05/07 14:57:38 INFO tools.DistCp: >>>> srcPaths=[hdfs://localhost:54310/tmp] >>>>> 12/05/07 14:57:38 INFO tools.DistCp: >>>>> destPath=hdfs://localhost:8021/tmp_copy >>>>> With failures, global counters are inaccurate; consider running with -i >>>>> Copy failed: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol >>>>> org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. >> (client >>>> = >>>>> 63, server = 61) >>>>> >>>>> I found that we can do distcp like above only if both are of the same >>>>> hadoop version. >>>>> so I tried: >>>>> >>>>> $ hadoop distcp hftp://localhost:50070/tmp >>>> hdfs://localhost:60070/tmp_copy >>>>> 12/05/07 15:02:44 INFO tools.DistCp: >>>> srcPaths=[hftp://localhost:50070/tmp] >>>>> 12/05/07 15:02:44 INFO tools.DistCp: >>>>> destPath=hdfs://localhost:60070/tmp_copy >>>>> >>>>> But this process seemed to be hangs at this stage. What might I be >> doing >>>>> wrong? >>>>> >>>>> hftp://<dfs.http.address>/<path> >>>>> hftp://localhost:50070 is dfs.http.address of 0.20.205 >>>>> hdfs://localhost:60070 is dfs.http.address of cdh3u3 >>>>> >>>>> Thanks and regards, >>>>> Austin >>>>> >>>>> >>>>> On Fri, May 4, 2012 at 4:30 AM, Michel Segel < >> michael_se...@hotmail.com >>>>> wrote: >>>>> >>>>>> Ok... So riddle me this... >>>>>> I currently have a replication factor of 3. >>>>>> I reset it to two. >>>>>> >>>>>> What do you have to do to get the replication factor of 3 down to 2? >>>>>> Do I just try to rebalance the nodes? >>>>>> >>>>>> The point is that you are looking at a very small cluster. >>>>>> You may want to start the be cluster with a replication factor of 2 >> and >>>>>> then when the data is moved over, increase it to a factor of 3. Or >> maybe >>>>>> not. >>>>>> >>>>>> I do a distcp to. Copy the data and after each distcp, I do an fsck >> for >>>> a >>>>>> sanity check and then remove the files I copied. As I gain more room, >> I >>>> can >>>>>> then slowly drop nodes, do an fsck, rebalance and then repeat. >>>>>> >>>>>> Even though this us a dev cluster, the OP wants to retain the data. >>>>>> >>>>>> There are other options depending on the amount and size of new >>>> hardware. >>>>>> I mean make one machine a RAID 5 machine, copy data to it clearing off >>>>>> the cluster. >>>>>> >>>>>> If 8TB was the amount of disk used, that would be 2.6666 TB used. >>>>>> Let's say 3TB. Going raid 5, how much disk is that? So you could fit >> it >>>>>> on one machine, depending on hardware, or maybe 2 machines... Now you >>>> can >>>>>> rebuild initial cluster and then move data back. Then rebuild those >>>>>> machines. Lots of options... ;-) >>>>>> >>>>>> Sent from a remote device. Please excuse any typos... >>>>>> >>>>>> Mike Segel >>>>>> >>>>>> On May 3, 2012, at 11:26 AM, Suresh Srinivas <sur...@hortonworks.com> >>>>>> wrote: >>>>>> >>>>>>> This probably is a more relevant question in CDH mailing lists. That >>>>>> said, >>>>>>> what Edward is suggesting seems reasonable. Reduce replication >> factor, >>>>>>> decommission some of the nodes and create a new cluster with those >>>> nodes >>>>>>> and do distcp. >>>>>>> >>>>>>> Could you share with us the reasons you want to migrate from Apache >>>> 205? >>>>>>> >>>>>>> Regards, >>>>>>> Suresh >>>>>>> >>>>>>> On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo < >>>> edlinuxg...@gmail.com >>>>>>> wrote: >>>>>>> >>>>>>>> Honestly that is a hassle, going from 205 to cdh3u3 is probably more >>>>>>>> or a cross-grade then an upgrade or downgrade. I would just stick it >>>>>>>> out. But yes like Michael said two clusters on the same gear and >>>>>>>> distcp. If you are using RF=3 you could also lower your replication >>>> to >>>>>>>> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving >>>>>>>> stuff. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 3, 2012 at 7:25 AM, Michel Segel < >>>>>> michael_se...@hotmail.com> >>>>>>>> wrote: >>>>>>>>> Ok... When you get your new hardware... >>>>>>>>> >>>>>>>>> Set up one server as your new NN, JT, SN. >>>>>>>>> Set up the others as a DN. >>>>>>>>> (Cloudera CDH3u3) >>>>>>>>> >>>>>>>>> On your existing cluster... >>>>>>>>> Remove your old log files, temp files on HDFS anything you don't >>>> need. >>>>>>>>> This should give you some more space. >>>>>>>>> Start copying some of the directories/files to the new cluster. >>>>>>>>> As you gain space, decommission a node, rebalance, add node to new >>>>>>>> cluster... >>>>>>>>> >>>>>>>>> It's a slow process. >>>>>>>>> >>>>>>>>> Should I remind you to make sure you up you bandwidth setting, and >>>> to >>>>>>>> clean up the hdfs directories when you repurpose the nodes? >>>>>>>>> >>>>>>>>> Does this make sense? >>>>>>>>> >>>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>>> >>>>>>>>> Mike Segel >>>>>>>>> >>>>>>>>> On May 3, 2012, at 5:46 AM, Austin Chungath <austi...@gmail.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yeah I know :-) >>>>>>>>>> and this is not a production cluster ;-) and yes there is more >>>>>> hardware >>>>>>>>>> coming :-) >>>>>>>>>> >>>>>>>>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel < >>>>>> michael_se...@hotmail.com >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Well, you've kind of painted yourself in to a corner... >>>>>>>>>>> Not sure why you didn't get a response from the Cloudera lists, >>>> but >>>>>>>> it's a >>>>>>>>>>> generic question... >>>>>>>>>>> >>>>>>>>>>> 8 out of 10 TB. Are you talking effective storage or actual >> disks? >>>>>>>>>>> And please tell me you've already ordered more hardware.. Right? >>>>>>>>>>> >>>>>>>>>>> And please tell me this isn't your production cluster... >>>>>>>>>>> >>>>>>>>>>> (Strong hint to Strata and Cloudea... You really want to accept >> my >>>>>>>>>>> upcoming proposal talk... ;-) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>>>>> >>>>>>>>>>> Mike Segel >>>>>>>>>>> >>>>>>>>>>> On May 3, 2012, at 5:25 AM, Austin Chungath <austi...@gmail.com> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes. This was first posted on the cloudera mailing list. There >>>>>> were no >>>>>>>>>>>> responses. >>>>>>>>>>>> >>>>>>>>>>>> But this is not related to cloudera as such. >>>>>>>>>>>> >>>>>>>>>>>> cdh3 is based on apache hadoop 0.20 as the base. My data is in >>>>>> apache >>>>>>>>>>>> hadoop 0.20.205 >>>>>>>>>>>> >>>>>>>>>>>> There is an upgrade namenode option when we are migrating to a >>>>>> higher >>>>>>>>>>>> version say from 0.20 to 0.20.205 >>>>>>>>>>>> but here I am downgrading from 0.20.205 to 0.20 (cdh3) >>>>>>>>>>>> Is this possible? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, May 3, 2012 at 3:25 PM, Prashant Kommireddi < >>>>>>>> prash1...@gmail.com >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Seems like a matter of upgrade. I am not a Cloudera user so >>>> would >>>>>> not >>>>>>>>>>> know >>>>>>>>>>>>> much, but you might find some help moving this to Cloudera >>>> mailing >>>>>>>> list. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 3, 2012 at 2:51 AM, Austin Chungath < >>>>>> austi...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> There is only one cluster. I am not copying between clusters. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Say I have a cluster running apache 0.20.205 with 10 TB >> storage >>>>>>>>>>> capacity >>>>>>>>>>>>>> and has about 8 TB of data. >>>>>>>>>>>>>> Now how can I migrate the same cluster to use cdh3 and use >> that >>>>>>>> same 8 >>>>>>>>>>> TB >>>>>>>>>>>>>> of data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I can't copy 8 TB of data using distcp because I have only 2 >> TB >>>>>> of >>>>>>>> free >>>>>>>>>>>>>> space >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 3, 2012 at 3:12 PM, Nitin Pawar < >>>>>>>> nitinpawar...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> you can actually look at the distcp >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> but this means that you have two different set of clusters >>>>>>>> available >>>>>>>>>>> to >>>>>>>>>>>>>> do >>>>>>>>>>>>>>> the migration >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:51 PM, Austin Chungath < >>>>>>>> austi...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the suggestions, >>>>>>>>>>>>>>>> My concerns are that I can't actually copyToLocal from the >>>> dfs >>>>>>>>>>>>> because >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> data is huge. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I >>>> can >>>>>> do >>>>>>>> a >>>>>>>>>>>>>>>> namenode upgrade. I don't have to copy data out of dfs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But here I am having Apache hadoop 0.20.205 and I want to >> use >>>>>> CDH3 >>>>>>>>>>>>> now, >>>>>>>>>>>>>>>> which is based on 0.20 >>>>>>>>>>>>>>>> Now it is actually a downgrade as 0.20.205's namenode info >>>> has >>>>>> to >>>>>>>> be >>>>>>>>>>>>>> used >>>>>>>>>>>>>>>> by 0.20's namenode. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any idea how I can achieve what I am trying to do? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar < >>>>>>>>>>>>> nitinpawar...@gmail.com >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> i can think of following options >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) write a simple get and put code which gets the data from >>>>>> DFS >>>>>>>> and >>>>>>>>>>>>>>> loads >>>>>>>>>>>>>>>>> it in dfs >>>>>>>>>>>>>>>>> 2) see if the distcp between both versions are compatible >>>>>>>>>>>>>>>>> 3) this is what I had done (and my data was hardly few >>>> hundred >>>>>>>> GB) >>>>>>>>>>>>> .. >>>>>>>>>>>>>>>> did a >>>>>>>>>>>>>>>>> dfs -copyToLocal and then in the new grid did a >>>> copyFromLocal >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, May 3, 2012 at 11:41 AM, Austin Chungath < >>>>>>>>>>>>> austi...@gmail.com >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> I am migrating from Apache hadoop 0.20.205 to CDH3u3. >>>>>>>>>>>>>>>>>> I don't want to lose the data that is in the HDFS of >> Apache >>>>>>>>>>>>> hadoop >>>>>>>>>>>>>>>>>> 0.20.205. >>>>>>>>>>>>>>>>>> How do I migrate to CDH3u3 but keep the data that I have >> on >>>>>>>>>>>>>> 0.20.205. >>>>>>>>>>>>>>>>>> What is the best practice/ techniques to do this? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>>>>>>> Austin >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Nitin Pawar >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >> >>