Re: hadoop MapReduce and stop words

2009-05-16 Thread Stefan Will
Just use a java.util.HashSet for this. There should only be a few dozen
stopwords, so load them into a HashSet when the Mapper starts up, and then
check your tokens against it while you're processing records.

-- Stefan


> From: tim robertson 
> Reply-To: 
> Date: Sat, 16 May 2009 15:48:23 +0200
> To: 
> Subject: Re: hadoop MapReduce and stop words
> 
> Try and google "binary tree java" and you will get loads of hits...
> 
> This is a simple implementation but I am sure there are better ones
> that handle balancing better.
> 
> Cheers
> Tim
> 
> 
> public class BinaryTree {
> 
> public static void main(String[] args) {
> BinaryTree bt = new BinaryTree();
> 
> for (int i = 0; i < 1; i++) {
> bt.insert("" + i);
> }
> 
> System.out.println(bt.lookup("999"));
> System.out.println(bt.lookup("100"));
> System.out.println(bt.lookup("a")); // should be null
> }
> 
> private Node root;
> 
> private static class Node {
> Node left;
> Node right;
> String value;
> 
> public Node(String value) {
> this.value = value;
> }
> }
> 
> public String lookup(String key) {
> return (lookup(root, key));
> }
> 
> private String lookup(Node node, String value) {
> if (node == null) {
> return (null);
> }
> 
> if (value.equals(node.value)) {
> return (node.value);
> } else if (value.compareTo(node.value) < 0) {
> return (lookup(node.left, value));
> } else {
> return (lookup(node.right, value));
> }
> }
> 
> public void insert(String value) {
> root = insert(root, value);
> }
> 
> private Node insert(Node node, String value) {
> if (node == null) {
> node = new Node(value);
> } else {
> if (value.compareTo(node.value) <= 0) {
> node.left = insert(node.left, value);
> } else {
> node.right = insert(node.right, value);
> }
> }
> 
> return (node);
> }
> }




Re: How to do load control of MapReduce

2009-05-12 Thread Stefan Will
Yes, I think the JVM uses way more memory than just its heap. Now some of it
might be just reserved memory, but not actually used (not sure how to tell
the difference). There are also things like thread stacks, jit compiler
cache, direct nio byte buffers etc. that take up process space outside of
the Java heap. But none of that should imho add up to Gigabytes...

-- Stefan 


> From: zsongbo 
> Reply-To: 
> Date: Tue, 12 May 2009 20:06:37 +0800
> To: 
> Subject: Re: How to do load control of MapReduce
> 
> Yes, I also found that the TaskTracker should not use so much memory.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 
> 32480 schubert  35  10 1411m 172m 9212 S0  2.2   8:54.78 java
> 
> The previous 1GB is the default value, I have just change the heap of TT to
> 384MB one hours ago.
> 
> I also found DataNode also need not too much memory.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 32399 schubert  25   0 1638m 372m 9208 S2  4.7  32:46.28 java
> 
> 
> In fact, I define the -Xmx512m in child opt for MapReduce tasks. But I found
> the child task use more memory than the definition:
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 
> 10577 schubert  30  10  942m 572m 9092 S   46  7.2  51:02.21 java
> 
> 10507 schubert  29  10  878m 570m 9092 S   48  7.1  50:49.52 java
> 
> Schubert
> 
> On Tue, May 12, 2009 at 6:53 PM, Steve Loughran  wrote:
> 
>> zsongbo wrote:
>> 
>>> Hi Stefan,
>>> Yes, the 'nice' cannot resolve this problem.
>>> 
>>> Now, in my cluster, there are 8GB of RAM. My java heap configuration is:
>>> 
>>> HDFS DataNode : 1GB
>>> HBase-RegionServer: 1.5GB
>>> MR-TaskTracker: 1GB
>>> MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)
>>> 
>>> But the memory usage is still tight.
>>> 
>> 
>> does TT need to be so big if you are running all your work in external VMs?
>> 




Re: Huge DataNode Virtual Memory Usage

2009-05-11 Thread Stefan Will
Raghu,

I don't actually have exact numbers from jmap, although I do remember that
jmap -histo reported something less than 256MB for this process (before I
restarted it).

I just looked at another DFS process that is currently running and has a VM
size of 1.5GB (~600 resident). Here jmap reports a total object heap usage
of 120MB. The memory block list reported by jmap  doesn't actually seem
to contain the heap at all since the largest block in that list is 10MB in
size (/usr/java/jdk1.6.0_10/jre/lib/amd64/server/libjvm.so). However, pmap
reports a total usage of 1.56GB.

-- Stefan



> From: Raghu Angadi 
> Reply-To: 
> Date: Sun, 10 May 2009 20:06:13 -0700
> To: 
> Subject: Re: Huge DataNode Virtual Memory Usage
> 
> what do 'jmap' and 'jmap -histo:live' show?.
> 
> Raghu.
> 
> Stefan Will wrote:
>> Chris,
>> 
>> Thanks for the tip ... However I'm already running 1.6_10:
>> 
>> java version "1.6.0_10"
>> Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)
>> 
>> Do you know of a specific bug # in the JDK bug database that addresses this
>> ?
>> 
>> Cheers,
>> Stefan
>> 
>> 
>>> From: Chris Collins 
>>> Reply-To: 
>>> Date: Fri, 8 May 2009 20:34:21 -0700
>>> To: "core-user@hadoop.apache.org" 
>>> Subject: Re: Huge DataNode Virtual Memory Usage
>>> 
>>> Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It
>>> manifested itself during major GC.  We saw this on linux and solaris
>>> and dramatically improved with an upgrade.
>>> 
>>> C
>>> On May 8, 2009, at 6:12 PM, Stefan Will wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I just ran into something rather scary: One of my datanode processes
>>>> that
>>>> I¹m running with ­Xmx256M, and a maximum number of Xceiver threads
>>>> of 4095
>>>> had a virtual memory size of over 7GB (!). I know that the VM size
>>>> on Linux
>>>> isn¹t necessarily equal to the actual memory used, but I wouldn¹t
>>>> expect it
>>>> to be an order of magnitude higher either. I ran pmap on the
>>>> process, and it
>>>> showed around 1000 thread stack blocks with roughly 1MB each (which
>>>> is the
>>>> default size on the 64bit JDK). The largest block was 3GB in size
>>>> which I
>>>> can¹t figure out what it is for.
>>>> 
>>>> Does anyone have any insights into this ? Anything that can be done to
>>>> prevent this other than to restart the DFS regularly ?
>>>> 
>>>> -- Stefan
>> 
>> 




Re: How to do load control of MapReduce

2009-05-11 Thread Stefan Will
I'm having similar performance issues and have been running my Hadoop
processes using a nice level of 10 for a while, and haven't noticed any
improvement.

In my case, I believe what's happening is that the peak combined RAM usage
of all the Hadoop task processes and the service processes exceeeds the
ammount of RAM on my machines. This in turn causes part of the server
processes to get paged out to disk while the nightly Hadoop batch processes
are running. Since the swap space is typically on the same physical disks as
the DFS and MapReduce working directory, I'm heavily IO bound and real time
queries pretty much slow down to a crawl.

I think the key is to make absolutely sure that all of your processes fit in
your available RAM at all times. I'm actually having a hard time achieving
this since the virtual memory usage of the JVM is usually way higher than
the maximum heap size (see my other thread).

-- Stefan


> From: zsongbo 
> Reply-To: 
> Date: Tue, 12 May 2009 10:58:49 +0800
> To: 
> Subject: Re: How to do load control of MapReduce
> 
> Thanks Billy,I am trying 'nice', and will report the result later.
> 
> On Tue, May 12, 2009 at 3:42 AM, Billy Pearson
> wrote:
> 
>> Might try setting the tasktrackers linux nice level to say 5 or 10
>> leavening dfs and hbase setting to 0
>> 
>> Billy
>> "zsongbo"  wrote in message
>> news:fa03480d0905110549j7f09be13qd434ca41c9f84...@mail.gmail.com...
>> 
>>  Hi all,
>>> Now, if we have a large dataset to process by MapReduce. The MapReduce
>>> will
>>> take machine resources as many as possible.
>>> 
>>> So when one such a big MapReduce job are running, the cluster would become
>>> very busy and almost cannot do anything else.
>>> 
>>> For example, we have a HDFS+MapReduc+HBase cluster.
>>> There are a large dataset in HDFS to be processed by MapReduce
>>> periodically,
>>> the workload is CPU and I/O heavy. And the cluster also provide other
>>> service for query (query HBase and read files in HDFS). So, when the job
>>> is
>>> running, the query latency will become very long.
>>> 
>>> Since the MapReduce job is not time sensitive, I want to control the load
>>> of
>>> MapReduce. Do you have some advices ?
>>> 
>>> Thanks in advance.
>>> Schubert
>>> 
>>> 
>> 
>> 




Re: Huge DataNode Virtual Memory Usage

2009-05-09 Thread Stefan Will
Chris,

Thanks for the tip ... However I'm already running 1.6_10:

java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Do you know of a specific bug # in the JDK bug database that addresses this
?

Cheers,
Stefan


> From: Chris Collins 
> Reply-To: 
> Date: Fri, 8 May 2009 20:34:21 -0700
> To: "core-user@hadoop.apache.org" 
> Subject: Re: Huge DataNode Virtual Memory Usage
> 
> Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It
> manifested itself during major GC.  We saw this on linux and solaris
> and dramatically improved with an upgrade.
> 
> C
> On May 8, 2009, at 6:12 PM, Stefan Will wrote:
> 
>> Hi,
>> 
>> I just ran into something rather scary: One of my datanode processes
>> that
>> I¹m running with ­Xmx256M, and a maximum number of Xceiver threads
>> of 4095
>> had a virtual memory size of over 7GB (!). I know that the VM size
>> on Linux
>> isn¹t necessarily equal to the actual memory used, but I wouldn¹t
>> expect it
>> to be an order of magnitude higher either. I ran pmap on the
>> process, and it
>> showed around 1000 thread stack blocks with roughly 1MB each (which
>> is the
>> default size on the 64bit JDK). The largest block was 3GB in size
>> which I
>> can¹t figure out what it is for.
>> 
>> Does anyone have any insights into this ? Anything that can be done to
>> prevent this other than to restart the DFS regularly ?
>> 
>> -- Stefan




Huge DataNode Virtual Memory Usage

2009-05-08 Thread Stefan Will
Hi,

I just ran into something rather scary: One of my datanode processes that
I¹m running with ­Xmx256M, and a maximum number of Xceiver threads of 4095
had a virtual memory size of over 7GB (!). I know that the VM size on Linux
isn¹t necessarily equal to the actual memory used, but I wouldn¹t expect it
to be an order of magnitude higher either. I ran pmap on the process, and it
showed around 1000 thread stack blocks with roughly 1MB each (which is the
default size on the 64bit JDK). The largest block was 3GB in size which I
can¹t figure out what it is for.

Does anyone have any insights into this ? Anything that can be done to
prevent this other than to restart the DFS regularly ?

-- Stefan


Re: Reduce task attempt retry strategy

2009-04-13 Thread Stefan Will
Jothi, thanks for the explanation. One question though: why shouldn't timed
out tasks be retried on a different machine ? As you pointed out, it could
very well have been due to the machine having problems. To me a timeout is
just like any other kind of failure.

-- Stefan


> From: Jothi Padmanabhan 
> Reply-To: 
> Date: Mon, 13 Apr 2009 19:00:38 +0530
> To: 
> Subject: Re: Reduce task attempt retry strategy
> 
> Currently, only failed tasks are attempted on a node other than the one
> where it failed. For killed tasks, there is no such policy for retries.
> 
> "failed to report status" usually indicates that the task did not report
> sufficient progress. However, it is possible that the task itself was not
> progressing fast enough because the machine where it ran had problems.
> 
> 
> On 4/8/09 12:33 AM, "Stefan Will"  wrote:
> 
>> My cluster has 27 nodes with a total reduce task capacity of 54. The job had
>> 31 reducers. I actually had a task today that showed the behavior you're
>> describing: 3 tries on one machine, and then the 4th on a different one.
>> 
>> As for the particular job I was talking about before:
>> 
>> Here are the stats for the job:
>> 
>> KindTotal Tasks(successful+failed+killed)Successful tasksFailed
>> tasksKilled tasksStart TimeFinish Time
>> Setup 1 1 0 0 4-Apr-2009 00:30:16 4-Apr-2009
>> 00:30:33 (17sec)
>> Map 64 49 12 3 4-Apr-2009 00:30:33 4-Apr-2009
>> 01:11:15 (40mins, 41sec)
>> Reduce 34 30 4 0 4-Apr-2009 00:30:44 4-Apr-2009
>> 04:31:36 (4hrs, 52sec)
>> Cleanup 4 0 4 0 4-Apr-2009 04:31:36 4-Apr-2009
>> 06:32:00 (2hrs, 24sec)
>> 
>> 
>> Not sure what to look for in the jobtracker log. All it shows for that
>> particular failed task is that it assigned it to the same machine 4 times
>> and then eventually failed. Perhaps something to note is that the 4 failures
>> were all due to timeouts:
>> 
>> "Task attempt_200904031942_0002_r_13_3 failed to report status for 1802
>> seconds. Killing!"
>> 
>> Also, looking at the logs, there was a map task too that was retried on that
>> particuar box 4 times without going to a different one. Perhaps it had
>> something to do with the way this machine failed: The jobtracker still
>> considered it live, while all actual tasks assigned to it timed out.
>> 
>> -- Stefan
>> 
>> 
>> 
>>> From: Amar Kamat 
>>> Reply-To: 
>>> Date: Tue, 07 Apr 2009 10:05:16 +0530
>>> To: 
>>> Subject: Re: Reduce task attempt retry strategy
>>> 
>>> Stefan Will wrote:
>>>> Hi,
>>>> 
>>>> I had a flaky machine the other day that was still accepting jobs and
>>>> sending heartbeats, but caused all reduce task attempts to fail. This in
>>>> turn caused the whole job to fail because the same reduce task was retried
>>>> 3
>>>> times on that particular machine.
>>>>   
>>> What is your cluster size? If a task fails on a machine then its
>>> re-tried on some other machine (based on number of good machines left in
>>> the cluster). After certain number of failures, the machine will be
>>> blacklisted (again based on number of machine left in the cluster). 3
>>> different reducers might be scheduled on that machine but that should
>>> not lead to job failure. Can you explain in detail what exactly
>>> happened. Find out where the attempts got scheduled from the
>>> jobtracker's log.
>>> Amar
>>>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
>>>> always thought that the framework would retry jobs on a different machine
>>>> if
>>>> retries on the original machine keep failing. E.g. I would have expected to
>>>> retry once or twice on the same machine, but then switch to a different one
>>>> to minimize the likelihood of getting stuck on a bad machine.
>>>> 
>>>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
>>>> improving on this in the future ?
>>>> 
>>>> Thanks,
>>>> Stefan
>>>> 
>>>>   
>> 
>> 




Re: Reduce task attempt retry strategy

2009-04-07 Thread Stefan Will
My cluster has 27 nodes with a total reduce task capacity of 54. The job had
31 reducers. I actually had a task today that showed the behavior you're
describing: 3 tries on one machine, and then the 4th on a different one.

As for the particular job I was talking about before:

Here are the stats for the job:

KindTotal Tasks(successful+failed+killed)Successful tasksFailed
tasksKilled tasksStart TimeFinish Time
Setup 1 1 0 0 4-Apr-2009 00:30:16 4-Apr-2009
00:30:33 (17sec)
Map 64 49 12 3 4-Apr-2009 00:30:33 4-Apr-2009
01:11:15 (40mins, 41sec)
Reduce 34 30 4 0 4-Apr-2009 00:30:44 4-Apr-2009
04:31:36 (4hrs, 52sec)
Cleanup 4 0 4 0 4-Apr-2009 04:31:36 4-Apr-2009
06:32:00 (2hrs, 24sec)


Not sure what to look for in the jobtracker log. All it shows for that
particular failed task is that it assigned it to the same machine 4 times
and then eventually failed. Perhaps something to note is that the 4 failures
were all due to timeouts:

"Task attempt_200904031942_0002_r_13_3 failed to report status for 1802
seconds. Killing!"

Also, looking at the logs, there was a map task too that was retried on that
particuar box 4 times without going to a different one. Perhaps it had
something to do with the way this machine failed: The jobtracker still
considered it live, while all actual tasks assigned to it timed out.

-- Stefan



> From: Amar Kamat 
> Reply-To: 
> Date: Tue, 07 Apr 2009 10:05:16 +0530
> To: 
> Subject: Re: Reduce task attempt retry strategy
> 
> Stefan Will wrote:
>> Hi,
>> 
>> I had a flaky machine the other day that was still accepting jobs and
>> sending heartbeats, but caused all reduce task attempts to fail. This in
>> turn caused the whole job to fail because the same reduce task was retried 3
>> times on that particular machine.
>>   
> What is your cluster size? If a task fails on a machine then its
> re-tried on some other machine (based on number of good machines left in
> the cluster). After certain number of failures, the machine will be
> blacklisted (again based on number of machine left in the cluster). 3
> different reducers might be scheduled on that machine but that should
> not lead to job failure. Can you explain in detail what exactly
> happened. Find out where the attempts got scheduled from the
> jobtracker's log.
> Amar
>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
>> always thought that the framework would retry jobs on a different machine if
>> retries on the original machine keep failing. E.g. I would have expected to
>> retry once or twice on the same machine, but then switch to a different one
>> to minimize the likelihood of getting stuck on a bad machine.
>> 
>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
>> improving on this in the future ?
>> 
>> Thanks,
>> Stefan
>> 
>>   




Reduce task attempt retry strategy

2009-04-06 Thread Stefan Will
Hi,

I had a flaky machine the other day that was still accepting jobs and
sending heartbeats, but caused all reduce task attempts to fail. This in
turn caused the whole job to fail because the same reduce task was retried 3
times on that particular machine.

Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
always thought that the framework would retry jobs on a different machine if
retries on the original machine keep failing. E.g. I would have expected to
retry once or twice on the same machine, but then switch to a different one
to minimize the likelihood of getting stuck on a bad machine.

What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
improving on this in the future ?

Thanks,
Stefan


Re: intermediate results not getting compressed

2009-03-19 Thread Stefan Will
I noticed this too. I think the compression only applies to the final mapper
and reducer outputs, but not any intermediate files produced. The reducer
will decompress the map output files after copying them, and then compress
its own output only after it has finished.

I wonder if this is by design, or just an oversight.

-- Stefan


> From: Billy Pearson 
> Reply-To: 
> Date: Wed, 18 Mar 2009 22:14:07 -0500
> To: 
> Subject: Re: intermediate results not getting compressed
> 
> I can run head on the map.out files and I get compressed garbish but I run
> head on a intermediate file and I can read the data in the file clearly so
> compression is not getting passed but I am setting the CompressMapOutput to
> true by default in my hadoop-site.conf file.
> 
> Billy
> 
> 
> "Billy Pearson" 
> wrote in message news:gpscu3$66...@ger.gmane.org...
>> the intermediate.X files are not getting compresses for some reason  not
>> sure why
>> I download and build the latest branch for 0.19
>> 
>> o.a.h.mapred.Merger.class line 432
>> new Writer(conf, fs, outputFile, keyClass, valueClass, codec);
>> 
>> this seams to use the codec defined above but for some reasion its not
>> working correctly the compression is not passing from the map output files
>> to the on disk merge of the intermediate.X files
>> 
>> tail task report from one server:
>> 
>> 2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
>> Interleaved on-disk merge complete: 1730 files left.
>> 2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
>> In-memory merge complete: 3 files left.
>> 2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping
>> 3 segments, 39835369 bytes in memory for intermediate, on-disk merge
>> 2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging
>> 1730 files, 70359998581 bytes from disk
>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging
>> 0 segments, 0 bytes from memory into reduce
>> 2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733
>> sorted segments
>> 2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
>> intermediate segments out of a total of 1733
>> 2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1712
>> 2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1683
>> 2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1654
>> 2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1625
>> 2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1596
>> 2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1567
>> 2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1538
>> 2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1509
>> 2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1480
>> 2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1451
>> 2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1422
>> 2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1393
>> 2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1364
>> 2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1335
>> 2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
>> intermediate segments out of a total of 1306
>> 
>> See the size of the files is about ~70GB (70359998581) these are
>> compressed at this points its went from 1733 file to 1306 left to merge
>> and the intermediate.X files are well over 200Gb at this point and we are
>> not even close to done. If compression is working we should not see task
>> failing at this point in the task becuase lack of hard drvie space sense
>> as we merge we delete the merged file from the output folder.
>> 
>> I only see this happening when there are to many files left that did not
>> get merged durring the shuffle stage and it starts on disk mergeing.
>> the task that complete the merges and keep it below the io.sort size in my
>> case 30 skips the on disk merge and complete useing normal hard drive
>> space.
>> 
>> Anyone care to take a look?
>> This job takes two or more days to get to this point so getting kind of a
>> pain in the butt to run and watch the reduces fail and the job keep
>

Re: Massive discrepancies in job's bytes written/read

2009-03-17 Thread Stefan Will
Some of the discrepancy could be due to compression of map input/output
format. E.g. The mapper output bytes will show the compressed size, while
the reduce input will show the uncompressed size. Or something along those
lines. But I'm also questioning the accuracy of the reporting and suspect
that some if it is due to all the disk activity that happens while
processing spills in the mapper and the copy/shuffle/sort phase in the
reducer. It would certainly be nice if all the byte counts were reported in
a way that they're comparable.

-- Stefan


> From: Bryan Duxbury 
> Reply-To: 
> Date: Tue, 17 Mar 2009 17:26:52 -0700
> To: 
> Subject: Massive discrepancies in job's bytes written/read
> 
> Hey all,
> 
> In looking at the stats for a number of our jobs, the amount of data
> that the UI claims we've read from or written to HDFS is vastly
> larger than the amount of data that should be involved in the job.
> For instance, we have a job that combines small files into big files
> that we're operating on around 2TB worth of data. The outputs in HDFS
> (via hadoop dfs -du) matches the expected size, but the jobtracker UI
> claims that we've read and written around 22TB of data!
> 
> By all accounts, Hadoop is actually *doing* the right thing - we're
> not observing excess data reading or writing anywhere. However, this
> massive discrepancy makes the job stats essentially worthless for
> understanding IO in our jobs.
> 
> Does anyone know why there's such an enormous difference? Have others
> experienced this problem?
> 
> -Bryan




Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Stefan Will
Yes, I am using MultipleOutputFormat with the side-band output happening in
the Mapper. But there is also Reducer that uses just the regular
OutputCollector.

-- Stefan


> From: Rasit OZDAS 
> Reply-To: 
> Date: Tue, 17 Feb 2009 15:43:12 +0200
> To: 
> Subject: Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0
> 
> Stefan and Thibaut, are you using MultipleOutputFormat, and how many
> reducers do you have?
> if you're using MultipleOutputFormat and have no reducer, there is a JIRA
> ticket about this issue.
> https://issues.apache.org/jira/browse/HADOOP-5268
> 
> Or there is a different JIRA issue (it's not resolved yet, but gives some
> underlying info)
> https://issues.apache.org/jira/browse/HADOOP-4264
> 
> Or this issue (not resolved):
> https://issues.apache.org/jira/browse/HADOOP-1583
> 
> Rasit
> 
> 2009/2/16 Thibaut_ 
> 
>> 
>> I have the same problem.
>> 
>> is there any solution to this?
>> 
>> Thibaut
>> 
>> 
>> --
>> View this message in context:
>> http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp
>> 21631077p22043484.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>> 
> 
> 
> -- 
> M. Raşit ÖZDAŞ




TaskTrackers being double counted after restart job recovery

2009-02-09 Thread Stefan Will
Hi,

I¹m using the new persistent job state feature in 0.19.0, and it¹s worked
really well so far. However, this morning my JobTracker died with and OOM
error (even though the heap size is set to 768M). So I killed it and all the
TaskTrackers. After starting everything up again, all my nodes were showing
up twice in the JobTracker web interface, with different port numbers. Also,
some of the jobs it restarted had already been completed when the job
tracker died.

 Any idea what might be happening here ? How can I fix this ? Will
temporarily setting mapred.jobtracker.restart.recover=false clear things up
?

-- Stefan


AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-01-23 Thread Stefan Will
Hi,

Since I¹ve upgraded to 0.19.0, I¹ve been getting the following exceptions
when restarting jobs, or even when a failed reducer is being restarted by
the job tracker. It appears that stale file locks in the namenode don¹t get
properly released sometimes:

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
create file ... for DFSClient_attempt_200901211615_0077_r_10_3 on client
10.1.20.119 because current leaseholder is trying to recreate file.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSName
system.java:1052)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.j
ava:995)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:301)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

Has anyone run into this ? Any workarounds/fixes ?

Thanks,
Stefan



Re: Merging reducer outputs into a single part-00000 file

2009-01-11 Thread Stefan Will
Jim,

As far as I know, there is no difference in terms of the number of output
partitions relative to the OutputFormat used.

If you want to sample your output file, I'd suggest you write a new MR job
that uses a random number generator to sample your output files, and outputs
text key/value pairs in the mapper, and uses exactly one reducer with the
TextOutputFormat. You don't even need to supply a reducer class if your
mapper outputs Text/Text key/value pairs.

-- Stefan


> From: Jim Twensky 
> Reply-To: 
> Date: Sun, 11 Jan 2009 01:55:35 -0600
> To: 
> Subject: Merging reducer outputs into a single part-0 file
> 
> Hello,

The original map-reduce paper states: "After successful completion,
> the
output of the map-reduce execution is available in the R output files
> (one
per reduce task, with file names as specified by the user)." However,
> when
using Hadoop's TextOutputFormat, all the reducer outputs are combined in
> a
single file called part-0. I was wondering how and when this
> merging
process is done. When the reducer calls output.collect(key,value), is
> this
record written to a local temporary output file in the reducer's disk
> and
then these local files (a total of R) are later merged into one single
> file
with a final thread or is it directly written to the final output
> file
(part-0)? I am asking this because I'd like to get an ordered sample
> of
the final output data, ie. one record per every 1000 records or
> something
similar and I don't want to run a serial process that iterates on
> the final
output file.

Thanks,
Jim





Re: NameNode fatal crash - 0.18.1

2009-01-08 Thread Stefan Will
I did some more poking around the name node code to find out what might have
caused it to deem the storage directories to be inaccessible. Apparently it
does this when it catches an IOException while trying to write to or append
to the edit log.

I guess that makes sense, but unfortunately it doesn't actually log the
actual exception:

try {   
   EditLogFileOutputStream eStream = new
EditLogFileOutputStream(getEditNewFile(sd));
   eStream.create();
   editStreams.add(eStream);
} catch (IOException e) {
// remove stream and this storage directory from list
processIOError(sd);
   it.remove(); }

-- Stefan


> From: Stefan Will 
> Reply-To: 
> Date: Thu, 08 Jan 2009 11:16:29 -0800
> To: 
> Subject: Re: NameNode fatal crash - 0.18.1
> 
> Jonathan,
> 
> It looks like the same thing happened to us today (on Hadoop 0.19.0). We
> were running a nightly backup, and at some point, the namenode said:
> 
> 
> 2009-01-08 05:57:28,021 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_2140680350762285267_117754,
> newgenerationstamp=117757, newlength=44866560, newtargets=[10.1.20.
> 116:50010, 10.1.20.111:50010])
> 2009-01-08 05:57:30,270 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
> log. Fatal Error.
> 2009-01-08 05:57:30,882 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> storage directories are inaccessible.
> 2009-01-08 05:57:31,072 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> 
> Now when I try and start up the namenode again, I get an EOFException:
> 
> 
> 2009-01-08 10:41:45,465 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at java.io.DataInputStream.readLong(DataInputStream.java:399)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.ja
> va:549)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:540)
> at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
> ava:227)
> at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
> ava:216)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage
> .java:289)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.j
> ava:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
> java:311)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
> :290)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
> )
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
> :859)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> 
> 
> Did you ever figure out why your backup caused this to happen ? Our backup
> wasn¹t even touching the Hadoop partitions on the master. Were you able to
> recover your DFS state ?
> 
> -- Stefan
> 
> 
>> From: Jonathan Gray 
>> Reply-To: 
>> Date: Mon, 15 Dec 2008 12:35:39 -0800
>> To: 
>> Subject: RE: NameNode fatal crash - 0.18.1
>> 
>> I have fixed the issue with the SecondaryNameNode not contacting primary
>> with the 'dfs.http.address' config option.
>> 
>> Other issues still unsolved.
>> 
>>> -Original Message-
>>> From: Jonathan Gray [mailto:jl...@streamy.com]
>>> Sent: Monday, December 15, 2008 10:55 AM
>>> To: core-user@hadoop.apache.org
>>> Subject: NameNode fatal crash - 0.18.1
>>> 
>>> I have a 10+1 node cluster, each slave running
>>> DataNode/TaskTracker/HBase
>>> RegionServer.
>>> 
>>> At the time of this crash, NameNode and SecondaryNameNode were both
>>> hosted
>>> on same master.
>>> 
>>> We do a nightly backup and about 95% of the way through, HDFS crashed
>>> with...
>>> 
>>> NameNode shows:
>>> 
>>> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable
>>> to
>>> sync edit log. Fatal Error.
>>> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal
>>> Error
>>> : All storage directo

Re: NameNode fatal crash - 0.18.1

2009-01-08 Thread Stefan Will
Jonathan,

It looks like the same thing happened to us today (on Hadoop 0.19.0). We
were running a nightly backup, and at some point, the namenode said:


2009-01-08 05:57:28,021 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_2140680350762285267_117754,
newgenerationstamp=117757, newlength=44866560, newtargets=[10.1.20.
116:50010, 10.1.20.111:50010])
2009-01-08 05:57:30,270 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
log. Fatal Error.
2009-01-08 05:57:30,882 FATAL
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
storage directories are inaccessible.
2009-01-08 05:57:31,072 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

Now when I try and start up the namenode again, I get an EOFException:


2009-01-08 10:41:45,465 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.ja
va:549)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:540)
at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
ava:227)
at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
ava:216)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage
.java:289)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.j
ava:87)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
java:311)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
:290)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
:859)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)


Did you ever figure out why your backup caused this to happen ? Our backup
wasn¹t even touching the Hadoop partitions on the master. Were you able to
recover your DFS state ?

-- Stefan


> From: Jonathan Gray 
> Reply-To: 
> Date: Mon, 15 Dec 2008 12:35:39 -0800
> To: 
> Subject: RE: NameNode fatal crash - 0.18.1
> 
> I have fixed the issue with the SecondaryNameNode not contacting primary
> with the 'dfs.http.address' config option.
> 
> Other issues still unsolved.
> 
>> -Original Message-
>> From: Jonathan Gray [mailto:jl...@streamy.com]
>> Sent: Monday, December 15, 2008 10:55 AM
>> To: core-user@hadoop.apache.org
>> Subject: NameNode fatal crash - 0.18.1
>> 
>> I have a 10+1 node cluster, each slave running
>> DataNode/TaskTracker/HBase
>> RegionServer.
>> 
>> At the time of this crash, NameNode and SecondaryNameNode were both
>> hosted
>> on same master.
>> 
>> We do a nightly backup and about 95% of the way through, HDFS crashed
>> with...
>> 
>> NameNode shows:
>> 
>> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable
>> to
>> sync edit log. Fatal Error.
>> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal
>> Error
>> : All storage directories are inaccessible.
>> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode:
>> SHUTDOWN_MSG:
>> 
>> Every single DataNode shows:
>> 
>> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
>> java.io.IOException: Call failed on local exception
>> at org.apache.hadoop.ipc.Client.call(Client.java:718)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>> at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
>> at
>> org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
>> at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
>> at java.lang.Thread.run(Thread.java:636)
>> Caused by: java.io.EOFException
>> at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499
>> )
>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>> 
>> 
>> This is virtually all of the information I have.  At the same time as
>> the
>> backup, we have normal HBase traffic and our hourly batch MR jobs.  So
>> slave
>> nodes were pretty heavily loaded, but don't see anything in DN logs
>> besides
>> this Call failed.  There are no space issues or anything else, Ganglia
>> shows
>> high CPU load around this time which has been typical every night, but
>> I
>> don't see any issues in DN's or NN about expired leases/no
>> heartbeats/etc.
>> 
>> Is there a way to preve

Re: Datanode not detecting full disk

2008-10-29 Thread Stefan Will
Hi Raghu,

Each DN machine has 3 partitions, e.g.:

FilesystemSize  Used Avail Use% Mounted on
/dev/sda1  20G  8.0G   11G  44% /
/dev/sda3 1.4T  756G  508G  60% /data
tmpfs 3.9G 0  3.9G   0% /dev/shm

All of the paths in hadoop-site.xml point to /data, which is the partition
that filled up to 100% (I deleted a bunch of files from HDFS since then). So
I guess the question is whether the DN looks at just the partition its data
directory is on, or all partitions when it determines disk usage.

-- Stefan


> From: Raghu Angadi <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Wed, 29 Oct 2008 11:57:07 -0700
> To: 
> Subject: Re: Datanode not detecting full disk
> 
> Stefan Will wrote:
>> Hi Jeff,
>> 
>> Yeah, it looks like I'm running into the issues described in the bug. I'm
>> running 0.18.1 on CentOS 5 by the way. Measuring available disk space
>> appears to be harder than I thought ... and here I was under the impression
>> the percentage in df was a pretty clear indicator of how full the disk is
>> ;-)
>> 
>> How did you guys solve/work around this ?
> 
> How many partitions do you have? If it is just one and NameNode thinks
> it has space though 'available' in df shows very less or no space, then
> you need to file a jira. There should be no case where DN reports more
> space than what 'available' field in 'df' shows.
> 
> But if you have more partitions and only some of them are full, then it
> is a different issue.. which should still be fixed.
> 
> Raghu.
> 
>> -- Stefan
>> 
>>  
>>> From: Jeff Hammerbacher <[EMAIL PROTECTED]>
>>> Reply-To: 
>>> Date: Mon, 27 Oct 2008 12:40:08 -0700
>>> To: 
>>> Subject: Re: Datanode not detecting full disk
>>> 
>>> Hey Stefan,
>>> 
>>> We used to have trouble with this issue at Facebook. What version are
>>> you running? You might get more information on this ticket:
>>> https://issues.apache.org/jira/browse/HADOOP-2991.
>>> 
>>> Regards,
>>> Jeff
>>> 
>>> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote:
>>>> Each of my datanodes has  a system and a data partition, with dfs.data.dir
>>>> pointed to the data partition. The data partition just filled up to 100% on
>>>> all of my nodes (as evident via df), but the NameNode web ui still shows
>>>> them only 88-94% full (interestingly, the numbers differ even though the
>>>> machines are configured identically). I thought the datanodes used df to
>>>> determine free space ? How is the storage utilization determined ?
>>>> 
>>>> -- Stefan
>>>> 
>> 
>> 




Re: Datanode not detecting full disk

2008-10-28 Thread Stefan Will
Hi Jeff,

Yeah, it looks like I'm running into the issues described in the bug. I'm
running 0.18.1 on CentOS 5 by the way. Measuring available disk space
appears to be harder than I thought ... and here I was under the impression
the percentage in df was a pretty clear indicator of how full the disk is
;-)

How did you guys solve/work around this ?

-- Stefan

 
> From: Jeff Hammerbacher <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Mon, 27 Oct 2008 12:40:08 -0700
> To: 
> Subject: Re: Datanode not detecting full disk
> 
> Hey Stefan,
> 
> We used to have trouble with this issue at Facebook. What version are
> you running? You might get more information on this ticket:
> https://issues.apache.org/jira/browse/HADOOP-2991.
> 
> Regards,
> Jeff
> 
> On Mon, Oct 27, 2008 at 10:00 AM, Stefan Will <[EMAIL PROTECTED]> wrote:
>> Each of my datanodes has  a system and a data partition, with dfs.data.dir
>> pointed to the data partition. The data partition just filled up to 100% on
>> all of my nodes (as evident via df), but the NameNode web ui still shows
>> them only 88-94% full (interestingly, the numbers differ even though the
>> machines are configured identically). I thought the datanodes used df to
>> determine free space ? How is the storage utilization determined ?
>> 
>> -- Stefan
>> 




Datanode not detecting full disk

2008-10-27 Thread Stefan Will
Each of my datanodes has  a system and a data partition, with dfs.data.dir
pointed to the data partition. The data partition just filled up to 100% on
all of my nodes (as evident via df), but the NameNode web ui still shows
them only 88-94% full (interestingly, the numbers differ even though the
machines are configured identically). I thought the datanodes used df to
determine free space ? How is the storage utilization determined ?

-- Stefan


Re: reduce task progress above 100%?

2008-09-16 Thread Stefan Will
Sorry to hijack the discussion, but I've been seeing the same behavior after
upgrading from 0.17.2 to 0.18.1-dev. I am using map output compression.

-- Stefan


> From: Arun C Murthy <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Tue, 16 Sep 2008 12:56:28 -0700
> To: 
> Subject: Re: reduce task progress above 100%?
> 
> 
> On Sep 16, 2008, at 12:26 PM, pvvpr wrote:
> 
>> Hello,
>>  A strange thing happened in my job. In reduce phase, one of the tasks
>> status shows 101.44% complete and runs till some 102% and successfully
>> finished back to 100%. Is this a right behavior?
>> 
> 
> Which version of Hadoop are you running; and are you using compression
> for map-outputs?
> 
> Arun
> 
>>  I put a quick copy-paste of the web GUI reporting completion of the
>> tasks. (sorry for the bad formatting below).
>> 
>> thanks,
>> Prasad Pingali,
>> IIIT Hyderabad.
>> 
>> 
>> 
>> Task Complete Status Start Time Finish Time Errors Counters
>> task_200809151235_0026_r_00 90.91%
>> 
>> reduce > reduce
>> 15-Sep-2008 18:40:06
>> 
>> 
>> 
>> 
>> 0
>> task_200809151235_0026_r_04 86.13%
>> 
>> reduce > reduce
>> 15-Sep-2008 18:40:11
>> 
>> 
>> 
>> 
>> 0
>> task_200809151235_0026_r_08 85.08%
>> 
>> reduce > reduce
>> 15-Sep-2008 18:40:16
>> 
>> 
>> 
>> 
>> 0
>> task_200809151235_0026_r_12 82.57%
>> 
>> reduce > reduce
>> 15-Sep-2008 18:40:21
>> 
>> 
>> 
>> 
>> 0
>> task_200809151235_0026_r_14 97.90%
>> 
>> reduce > reduce
>> 15-Sep-2008 18:40:22
>> 
>> 
>> 
>> 
>> 0
>> task_200809151235_0026_r_17 101.44%
>> reduce > reduce
>> 15-Sep-2008 23:34:04
>> 
>> 
>> 
>> 
>> 0
>> 
>> 




Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-10 Thread Stefan Will
I'll add a comment to Jira. I haven't tried the latest version of the patch
yet, but since it's only changes the dfs client, not the datanode, I don't
see how it would help with this.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing) is:

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

-- Stefan

> From: Raghu Angadi <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Tue, 09 Sep 2008 16:40:02 -0700
> To: 
> Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
> 
> Espen Amble Kolstad wrote:
>> There's a JIRA on this already:
>> https://issues.apache.org/jira/browse/HADOOP-3831
>> Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
>> to do the trick for now.
> 
> Please comment on HADOOP-3831 that you are seeing this error.. so that
> it gets committed. Did you try the patch for HADOOP-3831?
> 
> thanks,
> Raghu.
> 
>> Espen
>> 
>> On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad <[EMAIL PROTECTED]> 
>> wrote:
>>> Hi,
>>> 
>>> Thanks for the tip!
>>> 
>>> I tried revision 692572 of the 0.18 branch, but I still get the same errors.
>>> 
>>> On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
 The DFS errors might have been caused by
 
 http://issues.apache.org/jira/browse/HADOOP-4040
 
 thanks,
 dhruba
 
 On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> These exceptions are apparently coming from the dfs side of things. Could
> someone from the dfs side please look at these?
> 
> On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>> Hi,
>> 
>> Thanks!
>> The patch applies without change to hadoop-0.18.0, and should be
>> included in a 0.18.1.
>> 
>> However, I'm still seeing:
>> in hadoop.log:
>> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
>> from blk_3428404120239503595_2664 of
>> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
>> somehost:50010: java.io.IOException: Premeture EOF from in
>> putStream
>> 
>> in datanode.log:
>> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
>> DatanodeRegistration(somehost:50010,
>> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
>> ipcPort=50020):Got exception while serving
>> blk_-4682098638573619471_2662 to
>> /somehost:
>> java.net.SocketTimeoutException: 48 millis timeout while waiting
>> for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/somehost:50010
>> remote=/somehost:45244]
>> 
>> These entries in datanode.log happens a few minutes apart repeatedly.
>> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
>> free memory (so it's not resource starvation).
>> 
>> Espen
>> 
>> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
 I started a profile of the reduce-task. I've attached the profiling
 output. It seems from the samples that ramManager.waitForDataToMerge()
 doesn't actually wait.
 Has anybody seen this behavior.
>>> This has been fixed in HADOOP-3940
>>> 
>>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
 I have the same problem on our cluster.
 
 It seems the reducer-tasks are using all cpu, long before there's
 anything to
 shuffle.
 
 I started a profile of the reduce-task. I've attached the profiling
 output. It seems from the samples that ramManager.waitForDataToMerge()
 doesn't actually wait.
 Has anybody seen this behavior.
 
 Espen
 
 On Thursday 28 August 2008 06:11:42 wangxu wrote:
> Hi,all
> I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
> and running hadoop on one namenode and 4 slaves.
> attached is my hadoop-site.xml, and I didn't change the file
> hadoop-default.xml
> 
> when data in segments are large,this kind of errors occure:
> 
> java.io.IOException: Could not obtain block:
> blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-
> 2/data at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
> nt.jav a:1462) at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream

Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-09 Thread Stefan Will
I'm not sure whether this is the same issue or not, but on my 4 slave
cluster, setting the below parameter doesn't seem to fix the issue.

What I'm seeing is that occasionally data nodes stop responding for up to 10
minutes at a time. In this case, the TaskTrackers will mark the nodes as
dead, and occasionally the namenode will mark them as dead as well (you can
see the "Last Contact" time steadily increase for a random node at a time
every half hour or so.

This seems to be happening during times of high disk utilization.

-- Stefan



> From: Espen Amble Kolstad <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Mon, 8 Sep 2008 12:40:01 +0200
> To: 
> Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
> 
> There's a JIRA on this already:
> https://issues.apache.org/jira/browse/HADOOP-3831
> Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
> to do the trick for now.
> 
> Espen
> 
> On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad <[EMAIL PROTECTED]> 
> wrote:
>> Hi,
>> 
>> Thanks for the tip!
>> 
>> I tried revision 692572 of the 0.18 branch, but I still get the same errors.
>> 
>> On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
>>> The DFS errors might have been caused by
>>> 
>>> http://issues.apache.org/jira/browse/HADOOP-4040
>>> 
>>> thanks,
>>> dhruba
>>> 
>>> On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
 These exceptions are apparently coming from the dfs side of things. Could
 someone from the dfs side please look at these?
 
 On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> Thanks!
> The patch applies without change to hadoop-0.18.0, and should be
> included in a 0.18.1.
> 
> However, I'm still seeing:
> in hadoop.log:
> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
> from blk_3428404120239503595_2664 of
> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
> somehost:50010: java.io.IOException: Premeture EOF from in
> putStream
> 
> in datanode.log:
> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
> DatanodeRegistration(somehost:50010,
> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
> ipcPort=50020):Got exception while serving
> blk_-4682098638573619471_2662 to
> /somehost:
> java.net.SocketTimeoutException: 48 millis timeout while waiting
> for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/somehost:50010
> remote=/somehost:45244]
> 
> These entries in datanode.log happens a few minutes apart repeatedly.
> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
> free memory (so it's not resource starvation).
> 
> Espen
> 
> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
>>> I started a profile of the reduce-task. I've attached the profiling
>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>> doesn't actually wait.
>>> Has anybody seen this behavior.
>> 
>> This has been fixed in HADOOP-3940
>> 
>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>> I have the same problem on our cluster.
>>> 
>>> It seems the reducer-tasks are using all cpu, long before there's
>>> anything to
>>> shuffle.
>>> 
>>> I started a profile of the reduce-task. I've attached the profiling
>>> output. It seems from the samples that ramManager.waitForDataToMerge()
>>> doesn't actually wait.
>>> Has anybody seen this behavior.
>>> 
>>> Espen
>>> 
>>> On Thursday 28 August 2008 06:11:42 wangxu wrote:
 Hi,all
 I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
 and running hadoop on one namenode and 4 slaves.
 attached is my hadoop-site.xml, and I didn't change the file
 hadoop-default.xml
 
 when data in segments are large,this kind of errors occure:
 
 java.io.IOException: Could not obtain block:
 blk_-2634319951074439134_1129
 file=/user/root/crawl_debug/segments/20080825053518/content/part-
 2/data at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
 nt.jav a:1462) at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
 java:1 312) at
 org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
 17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
 ava:64 ) at
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
 ) at
 org.apache.hadoop.io.SequenceFile$Reader.re