Re: migrate cluster to different datacenter
It would help to know your data ingest and processing patterns (and any applicable SLAs). In most cases, you'd only need to move the raw ingested data, then you can derive the rest in the other cluster. Assuming that you have some sort of date-based partitioning on the ingest, then it's easy to define a cut-off point. Depending on your read SLAs, you could tee writes to both clusters for a period of time, or just simply switch off to the new one once the majority of data has been moved. Finally, you would want to do a consistency check to make sure everything made it to the other side... maybe run a checksum on derived data on both clusters and compare. Something like that... - P On Fri, Aug 3, 2012 at 5:19 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: thanks for response. Physical move is not a choice in this case. Purely looking for copying data and how to catch up with the update of a file while it is being migrated. On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote: sometimes, physically moving hard drives helps. :) On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We have a plan to migrate Hadoop cluster to a different datacenter where we can triple the size of the cluster. Currently, our 0.20.2 cluster have around 1PB of data. We use only Java/Pig. I would like to get some input how we gonna handle with transferring 1PB of data to a new site, and also keep up with new files that thrown into cluster all the time. Happy friday !! P
Re: migrate cluster to different datacenter
The OP hasn't provided enough information to even start trying to make a real recommendation on how to solve this problem. On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani bumble@gmail.com wrote: Given the size of data, there can be several approaches here: 1. Moving the boxes Not possible, as I suppose the data must be needed for 24x7 analytics. 2. Mirroring the data. This is a good solution. However, if you have data being written/removed continuously (if a part of live system), there are chances of losing some of the data during mirroring happens, unless a) You block writes/updates during that time (if you do so, that would be as good as unplugging and moving the machine around), or, b) Keep a track of what was modified since you started the mirroring process. I would recommend you to go with 2b) because it minimizes downtime. Here is how I think you can do it, by using some of the tools provided by Hadoop itself. a) You can use some fast distributed copying tool to copy large chunks of data. Before you kick-off with this, you can create a utility that tracks the modification of data made to your live system while copying is going on in the background. The utility will log the modifications into an audit trail. b) Once you're done copying the files, allow the new data store replication to catch up by reading the real-time modifications that were made, from your utility's log file. Once sync'ed up you can begin with the minimal downtime by switching off the JobTracker in live cluster so that new files are not created. c) As soon as you reach the last chunk of copying, change the DNS entries so that the hostnames referenced by the Hadoop jobs points to the new location. d) Turn on the JobTracker for the new cluster. e) Enjoy a drink with the money you saved by not using other paid third party solutions and pat your back! ;) The key of the above solution is to make data copying of step a) as fast as possible. Lesser the time, lesser the contents in audit trail, lesser the overall downtime. You can develop some in house solution for this, or use DistCp, provided by Hadoop that uses copies over the data using Map/Reduce. On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel michael_se...@hotmail.comwrote: Sorry at 1PB of disk... compression isn't going to really help a whole heck of a lot. Your networking bandwidth will be your bottleneck. So lets look at the problem. How much down time can you afford? What does your hardware look like? How much space do you have in your current data center? You have 1PB of data. OK, what does the access pattern look like? There are a couple of ways to slice and dice this. How many trucks do you have? On Aug 3, 2012, at 4:24 PM, Harit Himanshu harit.subscripti...@gmail.com wrote: Moving 1 PB of data would take loads of time, - check if this new data center provides something similar to http://aws.amazon.com/importexport/ - Consider multi part uploading of data - consider compressing the data On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote: thanks for response. Physical move is not a choice in this case. Purely looking for copying data and how to catch up with the update of a file while it is being migrated. On Fri, Aug 3, 2012 at 12:40 PM, Chen He airb...@gmail.com wrote: sometimes, physically moving hard drives helps. :) On Aug 3, 2012 1:50 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We have a plan to migrate Hadoop cluster to a different datacenter where we can triple the size of the cluster. Currently, our 0.20.2 cluster have around 1PB of data. We use only Java/Pig. I would like to get some input how we gonna handle with transferring 1PB of data to a new site, and also keep up with new files that thrown into cluster all the time. Happy friday !! P
Re: Basic Question
Each write call registers (writes) a KV pair to the output. The output collector does not look for similarities nor does it try to de-dupe it, and even if the object is the same, its value is copied so that doesn't matter. So you will get two KV pairs in your output - since duplication is allowed and is normal in several MR cases. Think of wordcount, where a map() call may emit lots of (is, 1) pairs if there are multiple is in the line it processes, and can use set() calls to its benefit to avoid too many object creation. On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: In Mapper I often use a Global Text object and througout the map processing I just call set on it. My question is, what happens if collector receives similar byte array value. Does the last one overwrite the value in collector? So if I did Text zip = new Text(); zip.set(9099); collector.write(zip,value); zip.set(9099); collector.write(zip,value1); Should I expect to receive both values in reducer or just one? -- Harsh J
Re: Basic Question
On Tue, Aug 7, 2012 at 11:33 AM, Harsh J ha...@cloudera.com wrote: Each write call registers (writes) a KV pair to the output. The output collector does not look for similarities nor does it try to de-dupe it, and even if the object is the same, its value is copied so that doesn't matter. So you will get two KV pairs in your output - since duplication is allowed and is normal in several MR cases. Think of wordcount, where a map() call may emit lots of (is, 1) pairs if there are multiple is in the line it processes, and can use set() calls to its benefit to avoid too many object creation. Thanks! On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: In Mapper I often use a Global Text object and througout the map processing I just call set on it. My question is, what happens if collector receives similar byte array value. Does the last one overwrite the value in collector? So if I did Text zip = new Text(); zip.set(9099); collector.write(zip,value); zip.set(9099); collector.write(zip,value1); Should I expect to receive both values in reducer or just one? -- Harsh J
Setting Configuration for local file:///
I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); }
Re: Setting Configuration for local file:///
What is GeoLookupConfigRunner and how do you utilize the setConf(conf) object within it? On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); } -- Harsh J
Re: Setting Configuration for local file:///
On Tue, Aug 7, 2012 at 12:50 PM, Harsh J ha...@cloudera.com wrote: What is GeoLookupConfigRunner and how do you utilize the setConf(conf) object within it? Thanks for the pointer I wasn't setting my JobConf object with the conf that I passed. Just one more related question, if I use JobConf conf = new JobConf(getConf()) and I don't pass in any configuration then does the data from xml files in the path used? I want this to work for all the scenarios. On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); } -- Harsh J
Local jobtracker in test env?
I just wrote a test where fs.default.name is file:/// and mapred.job.tracker is set to local. The test ran fine, I also see mapper and reducer were invoked but what I am trying to understand is that how did this run without specifying the job tracker port and which port task tracker connected with job tracker. It's not clear from the output: Also what's the difference between this and bringing up miniDFS cluster? INFO org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to proc ess : 1 INFO org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001 INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer = 79691776/99614 720 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer = 262144/32768 0 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map output INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_m_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: file:/c:/upb/dp/manch lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_m_ 00_0' done. INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments INFO org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last merge-pass, with 1 segments left of total size: 26 bytes INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I nside reduce INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O utside reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_r_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Task [Thread-11]: Task attempt_local_0001_r_0 0_0 is allowed to commit now INFO org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved output of task 'attempt_local_0001_r_00_0' to file:/c:/upb/dp/manchlia-dp/depot/servic es/data-platform/trunk/analytics/geooutput INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_r_ 00_0' done. INFO org.apache.hadoop.mapred.JobClient [main]: map 100% reduce 100% INFO org.apache.hadoop.mapred.JobClient [main]: Job complete: job_local_0001 INFO org.apache.hadoop.mapred.JobClient [main]: Counters: 15 INFO org.apache.hadoop.mapred.JobClient [main]: FileSystemCounters INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458 INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_WRITTEN=96110 INFO org.apache.hadoop.mapred.JobClient [main]: Map-Reduce Framework INFO org.apache.hadoop.mapred.JobClient [main]: Map input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0 INFO org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4 INFO org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20 INFO org.apache.hadoop.mapred.JobClient [main]: Total committed heap usage (bytes)=321527808 INFO org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18 INFO org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142 INFO org.apache.hadoop.mapred.JobClient [main]: Combine input records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1 INFO org.apache.hadoop.mapred.JobClient [main]: Combine output records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1 INFO org.apache.hadoop.mapred.JobClient [main]: Map output records=2 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Inside reduce INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Outsid e reduce Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.547 sec Results : Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
Re: Setting Configuration for local file:///
If you instantiate the JobConf with your existing conf object, then you needn't have that fear. On Wed, Aug 8, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Aug 7, 2012 at 12:50 PM, Harsh J ha...@cloudera.com wrote: What is GeoLookupConfigRunner and how do you utilize the setConf(conf) object within it? Thanks for the pointer I wasn't setting my JobConf object with the conf that I passed. Just one more related question, if I use JobConf conf = new JobConf(getConf()) and I don't pass in any configuration then does the data from xml files in the path used? I want this to work for all the scenarios. On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); } -- Harsh J -- Harsh J
Re: Local jobtracker in test env?
It used the local mode of operation: org.apache.hadoop.mapred.LocalJobRunner A JobTracker (via MiniMRCluster) is only required for simulating distributed tests. On Wed, Aug 8, 2012 at 2:27 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I just wrote a test where fs.default.name is file:/// and mapred.job.tracker is set to local. The test ran fine, I also see mapper and reducer were invoked but what I am trying to understand is that how did this run without specifying the job tracker port and which port task tracker connected with job tracker. It's not clear from the output: Also what's the difference between this and bringing up miniDFS cluster? INFO org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to proc ess : 1 INFO org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001 INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer = 79691776/99614 720 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer = 262144/32768 0 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map output INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_m_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: file:/c:/upb/dp/manch lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_m_ 00_0' done. INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments INFO org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last merge-pass, with 1 segments left of total size: 26 bytes INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I nside reduce INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O utside reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_r_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Task [Thread-11]: Task attempt_local_0001_r_0 0_0 is allowed to commit now INFO org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved output of task 'attempt_local_0001_r_00_0' to file:/c:/upb/dp/manchlia-dp/depot/servic es/data-platform/trunk/analytics/geooutput INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_r_ 00_0' done. INFO org.apache.hadoop.mapred.JobClient [main]: map 100% reduce 100% INFO org.apache.hadoop.mapred.JobClient [main]: Job complete: job_local_0001 INFO org.apache.hadoop.mapred.JobClient [main]: Counters: 15 INFO org.apache.hadoop.mapred.JobClient [main]: FileSystemCounters INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458 INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_WRITTEN=96110 INFO org.apache.hadoop.mapred.JobClient [main]: Map-Reduce Framework INFO org.apache.hadoop.mapred.JobClient [main]: Map input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0 INFO org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4 INFO org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20 INFO org.apache.hadoop.mapred.JobClient [main]: Total committed heap usage (bytes)=321527808 INFO org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18 INFO org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142 INFO org.apache.hadoop.mapred.JobClient [main]: Combine input records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1 INFO org.apache.hadoop.mapred.JobClient [main]: Combine output records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1 INFO org.apache.hadoop.mapred.JobClient [main]: Map output records=2 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]:
Re: [ANNOUNCE] - New user@ mailing list for hadoop users in-lieu of (common,hdfs,mapreduce)-user@
Apologies (again) for the cross-post, I've filed https://issues.apache.org/jira/browse/INFRA-5123 to close down (common, hdfs, mapreduce)-user@ since user@ is functional now. thanks, Arun On Aug 4, 2012, at 9:59 PM, Arun C Murthy wrote: All, Given our recent discussion (http://s.apache.org/hv), the new u...@hadoop.apache.org mailing list has been created and all existing users in (common,hdfs,mapreduce)-user@ have been migrated over. I'm in the process of changing the website to reflect this (HADOOP-8652). Henceforth, please use the new mailing list for all user-related discussions. thanks, Arun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/