Protecting NN JT UI with password
Hi, I am looking to know, how to protect the Hadoop Web UIs running on ports 50030, 50070 with password including HMASTER/60010? -- Thanks, Shah
Re: incremental loads into hadoop
This process of managing looks like more pain long term. Would it be easier to store in Hbase which has smaller block size? What's the avg. file size? On Sun, Oct 2, 2011 at 7:34 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: Agree with Bejoy, although to minimize the processing latency you can still choose to write more frequently to HDFS resulting into more number of smaller size files on HDFS rather than waiting to accumulate large size data before writing to HDFS. As you may have more number of smaller files, it may be good to use combine file input format to not have large number of very small map tasks (one per file if less than block size). Now after you process the input data, you may not want to leave these large number of small files on HDFS and hence you can use a Hadoop Archive (HAR) tool to combine and store them into small number of bigger size files.. You can run this tool periodically in the background to archive the input that is already processed.. Archive tool itself is implemented as M/R job. Also to get some level of atomicity, you may copy the data to HDFS at a temporary location before moving it to final source partition (or directory). Existing data loading tools may be doing that already. --Suhas Gogate On Sun, Oct 2, 2011 at 11:12 AM, bejoy.had...@gmail.com wrote: Sam Your understanding is right, hadoop definitely works great with large volume of data. But not necessarily every file should be in the range of Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data, It is the total data processed by a map reduce job(rather jobs, most use cases uses more than one map reduce job for processing). It can be 10K files that make up the whole data. Why not large number of small files? The over head on the name node in housekeeping all these large amount of meta data(file- block information) would be huge and there is definitely limits to it. But you can store smaller files together in splittable compressed formats. In general It is better to keep your file sizes atleast same or more than your hdfs block size. In default it is 64Mb but larger clusters have higher values as multiples of 64. If your hdfs block size or your file sizes are lesser than the map reduce input split size then it is better using InputFormats like CombinedInput Format or so for MR jobs. Usually the MR input split size is equal to your hdfs block size. In short as a better practice your single file size should be at least equal to one hdfs block size. The approach of keeping a file opened for long to write and then reading the same parallely with a map reduce, I fear it would work. AFAIK it won't. When a write is going on some blocks or the file itself would be locked, not really sure its the full file being locked or not. In short some blocks wouldn't be available for the concurrent Map Reduce Program during its processing. In your case a quick solution that comes to my mind is keep your real time data writing into the flume queue/buffer . Set it to a desired size once the queue gets full the data would be dumped into hdfs. Then as per your requirement you can kick off your jobs. If you are running MR jobs on very high frequency then make sure that for every run you have enough data to process and choose your max number of mappers and reducers effectively and efficiently Then as the last one, I don't think for normal cases you don't need to dump your large volume of data into lfs and then do a copyFromLocal into hdfs. Tools like flume are build for those purposes I guess. I'm not an expert on Flume, you may need to do more reading on the same before implementing. This what I feel on your use case. But let's leave it open for the experts to comment. Hope it helps. Regards Bejoy K S -Original Message- From: Sam Seigal selek...@yahoo.com Sender: saurabh@gmail.com Date: Sat, 1 Oct 2011 15:50:46 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: incremental loads into hadoop Hi Bejoy, Thanks for the response. While reading about Hadoop, I have come across threads where people claim that Hadoop is not a good fit for a large amount of small files. It is good for files that are gigabyes/petabytes in size. If I am doing incremental loads, let's say every hour. Do I need to wait until maybe at the end of the day when enough data has been collected to start off a MapReduce job ? I am wondering if an open file that is continuously being written to can at the same time be used as an input to an M/R job ... Also, let's say I did not want to do a load straight off the DB. The service, when committing a transaction to the OLTP system, sends a message for that transaction to a Hadoop Service that then writes the transaction into HDFS (the services are connected to each other via a persisted queue, hence are eventually consistent, but that is
Fw: pointing mapred.local.dir to a ramdisk
Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it. From the logs it looks like some disk based operation failed. I can't guess I must also confess that this is the first time I am using an ext2 file system. Any ideas? Raj
Re: pointing mapred.local.dir to a ramdisk
Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer $QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors() [0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename (RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename (ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath (MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes (MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize (TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init (TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main (TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it. From the logs it looks like some disk based operation failed. I can't guess I must also confess that this is the first time I am using an ext2 file system. Any ideas? Raj
Re: pointing mapred.local.dir to a ramdisk
Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it. From the logs it looks like some disk based operation failed. I can't guess I must also confess that this is the first time I am using an ext2 file system. Any ideas? Raj
Re: pointing mapred.local.dir to a ramdisk
Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it. From the logs it looks like some disk based operation failed. I can't guess I must also confess that this is the first time I am using an ext2 file system. Any ideas? Raj
Re: Help - can't start namenode after disk full error
hi, Ryan i'm trying to recover from disk full error on the namenode as well. i can fire up namenode after printf \xff\xff\xff\xee\xff /var/name/current/edits but now it's stuck in safe mode verifying blocks for hours... is there a way to check progress on that? or is there a way to speed that verify process up? thx
Re: incremental loads into hadoop
There is two method is there for processing OLTP 1. Hstremming or scibe these are only methodes 2. if not use chukuwa for storing the data so that when i you got a tesent volume then you can move to HDFS Thanks and Regards, S SYED ABDUL KATHER 9731841519 On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] ml-node+s472066n3383949...@n3.nabble.com wrote: Hi, I am relatively new to Hadoop and was wondering how to do incremental loads into HDFS. I have a continuous stream of data flowing into a service which is writing to an OLTP store. Due to the high volume of data, we cannot do aggregations on the OLTP store, since this starts affecting the write performance. We would like to offload this processing into a Hadoop cluster, mainly for doing aggregations/analytics. The question is how can this continuous stream of data be incrementally loaded and processed into Hadoop ? Thank you, Sam -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html To unsubscribe from Lucene, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw. - THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
lazy-loading of Reduce's input
Hi, My understanding is that when the reduce() method is called, the values (IterableVALUEIN values) are stored in memory. 1/ Is that actually true ? 2/ If this is true, is there a way to lazy-load the inputs to use less memory ? (e.g. load all the items by batches of 20, and discard the previously fetched ones) The only related option that I could find is mapreduce.reduce.input.limit, but it doesn't do what I need. The problem I am trying to solve is that my input values are huge objects (serialized lucene indices using a custom Writable implementation), and loading them all at once seems to require way too much memory. Thank You, Sami Dalouche
Re: incremental loads into hadoop
I have given HBase a fair amount of thought, and I am looking for input. Instead of managing incremental loads myself, why not just setup an HBase cluster ? What are some of the trade offs. My primary use for this cluster would still be data analysis/aggregation and not so much random access. Random access would be something which is a nice to have in case there are problems, and we want to examine the data ad-hoc. On Sat, Oct 1, 2011 at 12:31 PM, in.abdul in.ab...@gmail.com wrote: There is two method is there for processing OLTP 1. Hstremming or scibe these are only methodes 2. if not use chukuwa for storing the data so that when i you got a tesent volume then you can move to HDFS Thanks and Regards, S SYED ABDUL KATHER 9731841519 On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] ml-node+s472066n3383949...@n3.nabble.com wrote: Hi, I am relatively new to Hadoop and was wondering how to do incremental loads into HDFS. I have a continuous stream of data flowing into a service which is writing to an OLTP store. Due to the high volume of data, we cannot do aggregations on the OLTP store, since this starts affecting the write performance. We would like to offload this processing into a Hadoop cluster, mainly for doing aggregations/analytics. The question is how can this continuous stream of data be incrementally loaded and processed into Hadoop ? Thank you, Sam -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html To unsubscribe from Lucene, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw. - THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: pointing mapred.local.dir to a ramdisk
This directory can get very large, in many cases I doubt it would fit on a ram disk. Also RAM Disks tend to help most with random read/write, since hadoop is doing mostly linear IO you may not see a great benefit from the RAM disk. On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it. From the logs it looks like some disk based operation failed. I can't guess I must also
Re: pointing mapred.local.dir to a ramdisk
Vinod Carefully checked everything again. The permissions are 775 and the owner is hdfs:hadoop. The task tracker creates a directory called toBeDeleted under /ramdisk so things do not seem to be permssion related. The task tracker starts happily if I don't mount the ramdisk and leave everything else the same. Raj From: Vinod Kumar Vavilapalli vino...@hortonworks.com To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com Sent: Monday, October 3, 2011 9:07 AM Subject: Re: pointing mapred.local.dir to a ramdisk Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5 and here is my mapred-site.xml file property namemapred.local.dir/name value/ramdisk1/value /property If I have a regular directory on a regular drive such as below - it works. If I don't mount the ramdisk - it works. property namemapred.local.dir/name value/hadoop-dsk0/local,/hadoop-dsk1/local/value /property The NullPointerException does not tell me what the error is or how to fix it.
Re: pointing mapred.local.dir to a ramdisk
Edward I understand the size limitations - but for my experiment the ramdisk size I have created is large enough. I think there will be substantial benefits by putting the intermediate map outputs on a ramdisk - size permitting, ofcourse, but I can't provide any numbers to substantiate my claim given that I can't get it to run. -best regards Raj From: Edward Capriolo edlinuxg...@gmail.com To: common-user@hadoop.apache.org Cc: Raj V rajv...@yahoo.com Sent: Monday, October 3, 2011 10:36 AM Subject: Re: pointing mapred.local.dir to a ramdisk This directory can get very large, in many cases I doubt it would fit on a ram disk. Also RAM Disks tend to help most with random read/write, since hadoop is doing mostly linear IO you may not see a great benefit from the RAM disk. On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618) at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504) 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
Re: pointing mapred.local.dir to a ramdisk
Raj, I just tried this on my CHD3u1 VM, and the ramdisk worked the first time. So, it's possible you've hit a bug in CDH3b3 that was later fixed. Can you enable debug logging in log4j.properties and then repost your task tracker log? I think there might be more details that it will print that will be helpful. -Joey On Mon, Oct 3, 2011 at 2:18 PM, Raj V rajv...@yahoo.com wrote: Edward I understand the size limitations - but for my experiment the ramdisk size I have created is large enough. I think there will be substantial benefits by putting the intermediate map outputs on a ramdisk - size permitting, ofcourse, but I can't provide any numbers to substantiate my claim given that I can't get it to run. -best regards Raj From: Edward Capriolo edlinuxg...@gmail.com To: common-user@hadoop.apache.org Cc: Raj V rajv...@yahoo.com Sent: Monday, October 3, 2011 10:36 AM Subject: Re: pointing mapred.local.dir to a ramdisk This directory can get very large, in many cases I doubt it would fit on a ram disk. Also RAM Disks tend to help most with random read/write, since hadoop is doing mostly linear IO you may not see a great benefit from the RAM disk. On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404) at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255) at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311) at
Re: lazy-loading of Reduce's input
Just to make sure I was clear-enough : - Is there a parameter that allows to set the size of the batch of elements that are retrieved to memory while the reduce task iterates on the input values ? Thanks, Sami dalouche On Mon, Oct 3, 2011 at 1:42 PM, Sami Dalouche sa...@hopper.com wrote: Hi, My understanding is that when the reduce() method is called, the values (IterableVALUEIN values) are stored in memory. 1/ Is that actually true ? 2/ If this is true, is there a way to lazy-load the inputs to use less memory ? (e.g. load all the items by batches of 20, and discard the previously fetched ones) The only related option that I could find is mapreduce.reduce.input.limit, but it doesn't do what I need. The problem I am trying to solve is that my input values are huge objects (serialized lucene indices using a custom Writable implementation), and loading them all at once seems to require way too much memory. Thank You, Sami Dalouche
Re: pointing mapred.local.dir to a ramdisk
Joey Thanks. Will try and uppgrade to a newer version and check. I will also change the logs to debug and see if more information is available. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com Sent: Monday, October 3, 2011 11:49 AM Subject: Re: pointing mapred.local.dir to a ramdisk Raj, I just tried this on my CHD3u1 VM, and the ramdisk worked the first time. So, it's possible you've hit a bug in CDH3b3 that was later fixed. Can you enable debug logging in log4j.properties and then repost your task tracker log? I think there might be more details that it will print that will be helpful. -Joey On Mon, Oct 3, 2011 at 2:18 PM, Raj V rajv...@yahoo.com wrote: Edward I understand the size limitations - but for my experiment the ramdisk size I have created is large enough. I think there will be substantial benefits by putting the intermediate map outputs on a ramdisk - size permitting, ofcourse, but I can't provide any numbers to substantiate my claim given that I can't get it to run. -best regards Raj From: Edward Capriolo edlinuxg...@gmail.com To: common-user@hadoop.apache.org Cc: Raj V rajv...@yahoo.com Sent: Monday, October 3, 2011 10:36 AM Subject: Re: pointing mapred.local.dir to a ramdisk This directory can get very large, in many cases I doubt it would fit on a ram disk. Also RAM Disks tend to help most with random read/write, since hadoop is doing mostly linear IO you may not see a great benefit from the RAM disk. On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Must be related to some kind of permissions problems. It will help if you can paste the corresponding source code for FileUtil.copy(). Hard to track it with different versions, so. Thanks, +Vinod On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote: Eric Yes. The owner is hdfs and group is hadoop and the directory is group writable(775). This is tehe exact same configuration I have when I use real disks.But let me give it a try again to see if I overlooked something. Thanks Raj From: Eric Caspole eric.casp...@amd.com To: common-user@hadoop.apache.org Sent: Monday, October 3, 2011 8:44 AM Subject: Re: pointing mapred.local.dir to a ramdisk Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by your hadoop user? I have played with this in the past and it should basically work. On Oct 3, 2011, at 10:37 AM, Raj V wrote: Sending it to the hadoop mailing list - I think this is a hadoop related problem and not related to Cloudera distribution. Raj - Forwarded Message - From: Raj V rajv...@yahoo.com To: CDH Users cdh-u...@cloudera.org Sent: Friday, September 30, 2011 5:21 PM Subject: pointing mapred.local.dir to a ramdisk Hi all I have been trying some experiments to improve performance. One of the experiments involved pointing mapred.local.dir to a RAM disk. To this end I created a 128MB RAM disk ( each of my map outputs are smaller than this) but I have not been able to get the task tracker to start. I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the task tracker log. Tasktracker logs 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50060 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50060 webServer.getConnectors()[0].getLocalPort() returned 50060 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50060 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50060 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at
Monitoring Slow job.
Hi Hadoopers, I am writing script to detect if there is any running job has been running longer than X hours. So far, I use ./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }' --- to get the list of running JobID I am finding the way to get how long the job has been running from jobId. In the web admin page (http://job:50030), we can easily see the duration of each jobId pretty easily from *Started at:, ** Running for:*. of each running job. How do we get the such an information from command line? hope this make sense. Thanks you in advance, -P
Re: error for deploying hadoop on macbook pro
Harsh thanks, It worked out now I am able to start the cluster but when I tried to see the jobConf history http://localhost:50030/jobconf_history.jsp I got the following message. Missing 'logFile' for fetching job configuration! On Sep 30, 2011, at 5:41 PM, Harsh J wrote: Since you're only just beginning, and have unknowingly issued multiple namenode -format commands, simply run the following and restart DN alone: $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data (And please do not reformat namenode, lest you go out of namespace ID sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself of all HDFS files) On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com wrote: Now I am able to make task tracker and job tracker running but I still have following problem with datanode. ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055; datanode namespaceID = 964022125 On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote: I am trying to setup single node cluster using hadoop-0.20.204.0 and while setting I found my job tracker and task tracker are not starting. I am attaching the exception. I also don't know why my while formatting name node my IP address still doesn't show 127.0.0.1 as follows. 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.204.0 STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log -- Harsh J
Re: Monitoring Slow job.
I am not sure there is a easy way to get what you want on command line.. one option is to use following command which would give you verbose job history where you can find submit, Launch Finish time (including duration on FinishTime line). I am using hadoop-0.20.205.0 branch. So check if you have some such option for the version of hadoop you are using... I am pasting sample output for my wordcount program, bin/hadoop job -history job_output_directory_on_hdfs == horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output Warning: $HADOOP_HOME is deprecated. Hadoop job: 0001_1317688277686_vgogate = Job tracker host name: job job tracker start time: Sun May 16 08:53:51 PDT 1976 User: vgogate JobName: word count JobConf: hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml Submitted At: 3-Oct-2011 17:31:17 Launched At: 3-Oct-2011 17:31:17 (0sec) Finished At: 3-Oct-2011 17:31:50 (32sec) Status: SUCCESS Counters: |Group Name|Counter name |Map Value |Reduce Value|Total Value| --- |Job Counters |Launched reduce tasks |0 |0 |1 |Job Counters |SLOTS_MILLIS_MAPS |0 |0 |12,257 |Job Counters |Total time spent by all reduces waiting after reserving slots (ms)|0 |0 |0 |Job Counters |Total time spent by all maps waiting after reserving slots (ms)|0 |0 |0 |Job Counters |Launched map tasks|0 |0 |1 |Job Counters |Data-local map tasks |0 |0 |1 |Job Counters |SLOTS_MILLIS_REDUCES |0 |0 |10,082 |File Output Format Counters |Bytes Written |0 |61,192|61,192 |FileSystemCounters|FILE_BYTES_READ |0 |70,766|70,766 |FileSystemCounters|HDFS_BYTES_READ |112,056 |0 |112,056 |FileSystemCounters|FILE_BYTES_WRITTEN|92,325 |92,294|184,619 |FileSystemCounters|HDFS_BYTES_WRITTEN|0 |61,192|61,192 |File Input Format Counters|Bytes Read|111,933 |0 |111,933 |Map-Reduce Framework |Reduce input groups |0 |2,411 |2,411 |Map-Reduce Framework |Map output materialized bytes |70,766 |0 |70,766 |Map-Reduce Framework |Combine output records|2,411 |0 |2,411 |Map-Reduce Framework |Map input records |2,643 |0 |2,643 |Map-Reduce Framework |Reduce shuffle bytes |0 |0 |0 |Map-Reduce Framework |Reduce output records |0 |2,411 |2,411 |Map-Reduce Framework |Spilled Records |2,411 |2,411 |4,822 |Map-Reduce Framework |Map output bytes |120,995 |0 |120,995 |Map-Reduce Framework |Combine input records |5,849 |0 |5,849 |Map-Reduce Framework |Map output records|5,849 |0 |5,849 |Map-Reduce Framework |SPLIT_RAW_BYTES |123 |0 |123 |Map-Reduce Framework |Reduce input records |0 |2,411 |2,411 = Task Summary KindTotalSuccessfulFailedKilledStartTimeFinishTime Setup11003-Oct-2011 17:31:203-Oct-2011 17:31:24 (4sec) Map11003-Oct-2011 17:31:263-Oct-2011 17:31:30 (4sec) Reduce11003-Oct-2011 17:31:323-Oct-2011 17:31:42 (10sec) Cleanup11003-Oct-2011 17:31:443-Oct-2011 17:31:48 (4sec) Analysis = Time taken by best performing map task task_201110031726_0001_m_00: 4sec Average time taken by map tasks: 4sec Worse performing map tasks: TaskIdTimetaken task_201110031726_0001_m_00 4sec The last map task task_201110031726_0001_m_00 finished at (relative to the Job launch time): 3-Oct-2011 17:31:30 (12sec) Time taken by best performing shuffle task task_201110031726_0001_r_00: 7sec Average time taken by shuffle tasks: 7sec Worse performing shuffle tasks: TaskIdTimetaken task_201110031726_0001_r_00 7sec The last shuffle task task_201110031726_0001_r_00 finished at (relative to the Job launch time): 3-Oct-2011 17:31:39 (21sec) Time taken by best performing reduce task task_201110031726_0001_r_00: 2sec Average time taken by reduce tasks: 2sec Worse performing reduce tasks: TaskIdTimetaken task_201110031726_0001_r_00 2sec The last reduce task task_201110031726_0001_r_00 finished at (relative to the Job launch time): 3-Oct-2011 17:31:42
Re: error for deploying hadoop on macbook pro
Steps worked in the following document worked for me, except -- JAVA_HOME need to be set correctly in the conf/hadoop-env.sh -- By default on Mac OS X, sshd is not running, so need to start it using System Preferences/Sharing and add users who are allowed to do ssh. http://www.stanford.edu/class/cs246/cs246-11-mmds/hw_files/hadoop_install.pdf --Suhas On Mon, Oct 3, 2011 at 2:21 PM, Jignesh Patel jign...@websoft.com wrote: Harsh thanks, It worked out now I am able to start the cluster but when I tried to see the jobConf history http://localhost:50030/jobconf_history.jsp I got the following message. Missing 'logFile' for fetching job configuration! On Sep 30, 2011, at 5:41 PM, Harsh J wrote: Since you're only just beginning, and have unknowingly issued multiple namenode -format commands, simply run the following and restart DN alone: $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data (And please do not reformat namenode, lest you go out of namespace ID sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself of all HDFS files) On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com wrote: Now I am able to make task tracker and job tracker running but I still have following problem with datanode. ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055; datanode namespaceID = 964022125 On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote: I am trying to setup single node cluster using hadoop-0.20.204.0 and while setting I found my job tracker and task tracker are not starting. I am attaching the exception. I also don't know why my while formatting name node my IP address still doesn't show 127.0.0.1 as follows. 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.204.0 STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log -- Harsh J
Re: error for deploying hadoop on macbook pro
Sorry few more things, -- localhost did not work for me.. I had to use my machine name returned by hostname e.g. horton-mac.local -- Also change the localhost to your machine name in conf/slaves and conf/masters file. --Suhas On Mon, Oct 3, 2011 at 5:44 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: Steps worked in the following document worked for me, except -- JAVA_HOME need to be set correctly in the conf/hadoop-env.sh -- By default on Mac OS X, sshd is not running, so need to start it using System Preferences/Sharing and add users who are allowed to do ssh. http://www.stanford.edu/class/cs246/cs246-11-mmds/hw_files/hadoop_install.pdf --Suhas On Mon, Oct 3, 2011 at 2:21 PM, Jignesh Patel jign...@websoft.com wrote: Harsh thanks, It worked out now I am able to start the cluster but when I tried to see the jobConf history http://localhost:50030/jobconf_history.jsp I got the following message. Missing 'logFile' for fetching job configuration! On Sep 30, 2011, at 5:41 PM, Harsh J wrote: Since you're only just beginning, and have unknowingly issued multiple namenode -format commands, simply run the following and restart DN alone: $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data (And please do not reformat namenode, lest you go out of namespace ID sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself of all HDFS files) On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com wrote: Now I am able to make task tracker and job tracker running but I still have following problem with datanode. ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055; datanode namespaceID = 964022125 On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote: I am trying to setup single node cluster using hadoop-0.20.204.0 and while setting I found my job tracker and task tracker are not starting. I am attaching the exception. I also don't know why my while formatting name node my IP address still doesn't show 127.0.0.1 as follows. 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = Jignesh-MacBookPro.local/192.168.1.120 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.204.0 STARTUP_MSG: build = git://hrt8n35.cc1.ygridcore.net/ on branch branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011 hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log -- Harsh J
Adjusting column value size.
Hi, I have a question regarding the performance and column value size. I need to store per row several million integers. (Several million is important here) I was wondering which method would be more beneficial performance wise. 1) Store each integer to a single column so that when a row is called, several million columns will also be called. And the user would map each column values to some kind of container (ex: vector, arrayList) 2) Store, for example, a thousand integers into a single column (by concatenating them) so that when a row is called, only several thousand columns will be called along. The user would have to split the column value into 4 bytes and map the split integer to some kind of container (ex: vector, arrayList) I am curious which approach would be better. 1) would call several millions of columns but no additional process is needed. 2) would call only several thousands of columns but additional process is needed. Any advice would be appreciated. Ed