Re: How to debug a MapReduce application

2009-01-19 Thread Pedro Vivancos
I am terribly sorry. I made a mistake. This is the output I get:

09/01/19 07:59:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/01/19 07:59:45 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
09/01/19 07:59:45 INFO mapred.JobClient: Running job: job_local_0001
09/01/19 07:59:45 INFO mapred.MapTask: numReduceTasks: 1
09/01/19 07:59:45 INFO mapred.MapTask: io.sort.mb = 100
09/01/19 07:59:46 INFO mapred.MapTask: data buffer = 79691776/99614720
09/01/19 07:59:46 INFO mapred.MapTask: record buffer = 262144/327680
09/01/19 07:59:46 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:504)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:295)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
09/01/19 07:59:46 ERROR memo.MemoAnnotationMerging: Se ha producido un error
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at
es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at
es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)



On Mon, Jan 19, 2009 at 8:47 AM, Pedro Vivancos
pedro.vivan...@vocali.netwrote:

 Thank you very much, but actually I would like to run my application as a
 standalone one.

 Anyway I tried to execute it on a pseudo distributed mode with that setup
 and this what I got:

 09/01/19 07:45:24 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 0 time(s).
 09/01/19 07:45:25 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 1 time(s).
 09/01/19 07:45:26 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 2 time(s).
 09/01/19 07:45:27 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 3 time(s).
 09/01/19 07:45:28 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 4 time(s).
 09/01/19 07:45:29 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 5 time(s).
 09/01/19 07:45:30 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 6 time(s).
 09/01/19 07:45:31 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 7 time(s).
 09/01/19 07:45:32 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 8 time(s).
 09/01/19 07:45:33 INFO ipc.Client: Retrying connect to server: localhost/
 127.0.0.1:9000. Already tried 9 time(s).
 java.lang.RuntimeException: java.io.IOException: Call to localhost/
 127.0.0.1:9000 failed on local exception: Connection refused
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:323)
 at
 org.apache.hadoop.mapred.FileOutputFormat.setOutputPath(FileOutputFormat.java:118)
 at
 es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:156)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at 

Re: Hadoop 0.17.1 = EOFException reading FSEdits file, what causes this? how to prevent?

2009-01-19 Thread Rasit OZDAS
I would prefer catching the EOFException in my own code,
assuming you are happy with the output before exception occurs.

Hope this helps,
Rasit

2009/1/16 Konstantin Shvachko s...@yahoo-inc.com

 Joe,

 It looks like you edits file is corrupted or truncated.
 Most probably the last modification was not written to it,
 when the name-node was turned off. This may happen if the
 node crashes depending on the underlying local file system I guess.

 Here are some options for you to consider:
 - try an alternative replica of the image directory if you had one.
 - try to edit the edits file if you know the internal format.
 - try to modify local copy of your name-node code, which should
 catch EOFException and ignore it.
 - Use a checkpointed image if you can afford to loose latest modifications
 to the fs.
 - Formatting of cause is the last resort since you loose everything.

 Thanks,
 --Konstantin


 Joe Montanez wrote:

 Hi:


 I'm using Hadoop 0.17.1 and I'm encountering EOFException reading the
 FSEdits file.  I don't have a clear understanding what is causing this
 and how to prevent this.  Has anyone seen this and can advise?


 Thanks in advance,

 Joe


 2009-01-12 22:51:45,573 ERROR org.apache.hadoop.dfs.NameNode:
 java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:180)

at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)

at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)

at
 org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:599)

at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)

at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)

at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)

at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)

at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)

at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)

at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)

at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)

at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)


 2009-01-12 22:51:45,574 INFO org.apache.hadoop.dfs.NameNode:
 SHUTDOWN_MSG:






-- 
M. Raşit ÖZDAŞ


Re: Calling a mapreduce job from inside another

2009-01-19 Thread Sagar Naik
You can also play with the priority of the jobs to have the innermost 
job finish first


-Sagar

Devaraj Das wrote:

You can chain job submissions at the client. Also, you can run more than one
job in parallel (if you have enough task slots). An example of chaining jobs
is there in src/examples/org/apache/hadoop/examples/Grep.java where the jobs
grep-search and grep-sort are chained..


On 1/18/09 9:58 AM, Aditya Desai aditya3...@gmail.com wrote:

  

Is it possible to call a mapreduce job from inside another, if yes how?
and is it possible to disable the reducer completely that is suspend the job
immediately after call to map has been terminated.
I have tried -reducer NONE. I am using the streaming api to code in python

Regards,
Aditya Desai.




  


Haddop Error Massage

2009-01-19 Thread Deepak Diwakar
Hi friends,

could somebody tell me what does the following quoted massage mean?

 3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (15major+6092226minor)pagefaults 0swaps

First part tells about system usage but what is rest part? Is it because of
Heap size of program?

I am running hadoop task in standalone  mode on almost 250GB of compressed
data.

This error massage  comes after finishing the task.

Thanks in advance,
-- 
- Deepak Diwakar,


Re: Haddop Error Massage

2009-01-19 Thread Miles Osborne
that is a timing / space report

Miles

2009/1/19 Deepak Diwakar ddeepa...@gmail.com:
 Hi friends,

 could somebody tell me what does the following quoted massage mean?

  3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata
 0maxresident)k
 0inputs+0outputs (15major+6092226minor)pagefaults 0swaps

 First part tells about system usage but what is rest part? Is it because of
 Heap size of program?

 I am running hadoop task in standalone  mode on almost 250GB of compressed
 data.

 This error massage  comes after finishing the task.

 Thanks in advance,
 --
 - Deepak Diwakar,




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Windows Support

2009-01-19 Thread Dan Diephouse
I recognize that Windows support is, um, limited :-) But, any ideas what
exactly would need to be changed to support Windows (without cygwin) if
someone such as myself were so motivated? The most immediate thing I ran
into was the UserGroupInformation which would need a windows implementation.
I see there is an issue to switch to JAAS too, which may be the proper fix?
Are there lots of other things that would need to be changed?

I think it may be worth opening a JIRA for windows support and creating some
subtasks for the various issues, even if no one tackles them quite yet.

Thanks,
Dan

-- 
Dan Diephouse
http://netzooid.com/blog


Re: Haddop Error Massage

2009-01-19 Thread Deepak Diwakar
Thanks friend.


2009/1/19 Miles Osborne mi...@inf.ed.ac.uk

 that is a timing / space report

 Miles

 2009/1/19 Deepak Diwakar ddeepa...@gmail.com:
  Hi friends,
 
  could somebody tell me what does the following quoted massage mean?
 
   3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata
  0maxresident)k
  0inputs+0outputs (15major+6092226minor)pagefaults 0swaps
 
  First part tells about system usage but what is rest part? Is it because
 of
  Heap size of program?
 
  I am running hadoop task in standalone  mode on almost 250GB of
 compressed
  data.
 
  This error massage  comes after finishing the task.
 
  Thanks in advance,
  --
  - Deepak Diwakar,
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
- Deepak Diwakar,


Re: Windows Support

2009-01-19 Thread Dan Diephouse
On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org wrote:

 Dan Diephouse wrote:

 I recognize that Windows support is, um, limited :-) But, any ideas what
 exactly would need to be changed to support Windows (without cygwin) if
 someone such as myself were so motivated? The most immediate thing I ran
 into was the UserGroupInformation which would need a windows
 implementation.
 I see there is an issue to switch to JAAS too, which may be the proper
 fix?
 Are there lots of other things that would need to be changed?

 I think it may be worth opening a JIRA for windows support and creating
 some
 subtasks for the various issues, even if no one tackles them quite yet.

 Thanks,
 Dan


 I think a key one you need to address is motiviation. Is cygwin that bad
 for a piece of server-side code?


No I guess I was trying to get an idea of how much work it was. It seems
easy enough to supply a WindowsUserGroupInformation class (or a platform
agnostic one).  I wondered how many other things like this there were before
I put together a patch. Seems bad Java practices to depend on shell
utilities :-). Not very platform agnostic...
Dan

-- 
Dan Diephouse
http://netzooid.com/blog


Re: Windows Support

2009-01-19 Thread Chris K Wensel

Hey Dan

There is discussion/issue on this here:
https://issues.apache.org/jira/browse/HADOOP-4998

ckw

On Jan 19, 2009, at 8:55 AM, Dan Diephouse wrote:

On Mon, Jan 19, 2009 at 11:35 AM, Steve Loughran ste...@apache.org  
wrote:



Dan Diephouse wrote:

I recognize that Windows support is, um, limited :-) But, any  
ideas what
exactly would need to be changed to support Windows (without  
cygwin) if
someone such as myself were so motivated? The most immediate thing  
I ran

into was the UserGroupInformation which would need a windows
implementation.
I see there is an issue to switch to JAAS too, which may be the  
proper

fix?
Are there lots of other things that would need to be changed?

I think it may be worth opening a JIRA for windows support and  
creating

some
subtasks for the various issues, even if no one tackles them quite  
yet.


Thanks,
Dan


I think a key one you need to address is motiviation. Is cygwin  
that bad

for a piece of server-side code?



No I guess I was trying to get an idea of how much work it was.  
It seems
easy enough to supply a WindowsUserGroupInformation class (or a  
platform
agnostic one).  I wondered how many other things like this there  
were before

I put together a patch. Seems bad Java practices to depend on shell
utilities :-). Not very platform agnostic...
Dan

--
Dan Diephouse
http://netzooid.com/blog


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Java RMI and Hadoop RecordIO

2009-01-19 Thread David Alves

Hi
	I've been testing some different serialization techniques, to go  
along with a research project.
	I know motivation behind hadoop serialization mechanism (e.g.  
Writable) and the enhancement of this feature through record I/O is  
not only performance, but also control of the input/output.
	Still I've been running some simple tests and I've foud that plain  
RMi beats Hadoop RecordIO almost every time (14-16% faster).
	In my test I have a simple java class that has 14 int fields and 1  
long field and I'm serializing aroung 35000 instances.
	Am I doing anything wrong? are there ways to improve performance in  
RecordIO? Have I got the use case wrong?


Regards
David Alves



Java RMI and Hadoop RecordIO

2009-01-19 Thread David Alves

Hi
	I've been testing some different serialization techniques, to use in  
a research project.
	I know motivation behind hadoop serialization mechanism (e.g.  
Writable) and the enhancement of this feature through record I/O is  
not only performance, but also control of the input/output.
	Still I've been running some simple tests and I've foud that plain  
RMi beats Hadoop RecordIO almost every time (14-16% faster).
	In my test I have a simple java class that has 14 int fields and 1  
long field and I'm serializing aroung 35000 instances.
	Am I doing anything wrong? are there ways to improve performance in  
RecordIO? Have I got the use case wrong?


Regards
David Alves


Re: Performance testing

2009-01-19 Thread Sandeep Dhawan

Hi,

I am in the process of following your guidelines. 

I would like to know:

1. How can block size impact the performance of a mapred job.
2. Does the performance improve if I setup NameNode and JobTracker on
different machine. At present,
I am running Namenode and JobTracker on the same machine as Master
interconnected to 2 slave machines running Datanode and TaskTracker
3. What should be the replication factor for a 3 node cluster
4. How does io.sort.mb impact the performance of the cluster

Thanks,
Sandeep 


Brian Bockelman wrote:
 
 Hey Sandeep,
 
 I'd do a couple of things:
 1) Run your test.  Do something which will be similar to your actual  
 workflow.
 2) Save the resulting Ganglia plots.  This will give you a hint as to  
 where things are bottlenecking (memory, CPU, wait I/O).
 3) Watch iostat and find out the I/O rates during the test.  Compare  
 this to the I/O rates of a known I/O benchmark (i.e., Bonnie+).
 4) Finally, watch the logfiles closely.  If you start to overload  
 things, you'll usually get a pretty good indication from Hadoop where  
 things go wrong.  Once something does go wrong, *then* look through  
 the parameters to see what can be done.
 
 There's about a hundred things which can go wrong between the kernel,  
 the OS, Java, and the application code.  It's difficult to make an  
 educated guess beforehand without some hint from the data.
 
 Brian
 
 On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote:
 

 Hi Brian,

 That's what my issue is i.e. How do I ascertain the bottleneck or  
 in other
 words if the results obtained after doing the performance testing  
 are not
 upto the mark then How do I find the bottleneck.

 How can we confidently say that OS and hardware are the culprits. I
 understand that by using the latest OS and hardware can improve the
 performance irrespective of the application but my real worry is  
 What Next
 . How can I further increase the performance. What should I look  
 for which
 can suggest or point the areas which can be potential problems or  
 hotspot.

 Thanks for your comments.

 ~Sandeep~


 Brian Bockelman wrote:

 Hey Sandeep,

 I would warn against premature optimization: first, run your test,
 then see how far from your target you are.

 Of course, I'd wager you'd find that the hardware you are using is
 woefully underpowered and that your OS is 5 years old.

 Brian

 On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:


 Hi,

 I am trying to create a hadoop cluster which can handle 2000 write
 requests
 per second.
 In each write request I would writing a line of size 1KB in a file.

 I would be using machine having following configuration:
 Platfom: Red Hat Linux 9.0
 CPU : 2.07 GHz
 RAM : 1GB

 Can anyone help in giving me some pointers/guideline as to how to go
 about
 setting up such a cluster.
 What are the configuration parameters in hadoop with which we can
 tweak to
 ehance the performance of the hadoop cluster.

 Thanks,
 Sandeep
 -- 
 View this message in context:
 http://www.nabble.com/Performance-testing-tp21216266p21216266.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




 -- 
 View this message in context:
 http://www.nabble.com/Performance-testing-tp21216266p21228264.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Performance-testing-tp21216266p21548160.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Hadoop Exceptions

2009-01-19 Thread Sandeep Dhawan

Here are few hadoop exceptions that I am getting while running mapred job on
700MB of data on a 3 node cluster on Windows platform (using cygwin):

1. 2009-01-08 17:54:10,597 INFO org.apache.hadoop.dfs.DataNode: writeBlock
blk_-4309088198093040326_1001 received exception java.io.IOException: Block
blk_-4309088198093040326_1001 is valid, and cannot be written to.
2009-01-08 17:54:10,597 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(10.120.12.91:50010,
storageID=DS-70805886-10.120.12.91-50010-1231381442699, infoPort=50075,
ipcPort=50020):DataXceiver: java.io.IOException: Block
blk_-4309088198093040326_1001 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:921)
at 
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:2364)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1218)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1076)
at java.lang.Thread.run(Thread.java:619)

2. This particular job succeeded. Is it possible that this task was a
speculative execution and was killed before it could be started.
Exception in thread main java.lang.NullPointerException
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2195)

3. 2009-01-15 21:27:13,547 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200901152118_0001_r_00_0 Merge of the inmemory files threw an
exception: java.io.IOException: Expecting a line not the end of stream
at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at
org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2105)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2078)

4. 2009-01-15 21:27:13,547 INFO org.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 47 files left.
2009-01-15 21:27:13,579 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: attempt_200901152118_0001_r_00_0The reduce copier
failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

5. Caused by: java.io.IOException: An established connection was aborted by
the software in your host machine
... 12 more

Can anyone help me in giving some pointers to what could be the issue. 

Thanks,
Sandeep


-- 
View this message in context: 
http://www.nabble.com/Hadoop-Exceptions-tp21548261p21548261.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Upgrading and patching

2009-01-19 Thread Philip
Thanks Brian,

I have just one more question:

When building my own release where do I enter in the version and compiled by
information?

Thanks,
Phil


On Fri, Jan 16, 2009 at 6:23 PM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Philip,

 I've found it easier to download the release, apply the patches, and then
 re-build the release.  It's really pleasant to build the release.

 I suppose it's equivalent to check it out from SVN.

 Brian


 On Jan 16, 2009, at 1:46 PM, Philip wrote:

  Hello All,

 I'm currently trying to upgrade a hadoop 0.18.0 cluster to 0.19.  The
 wrinkle is that I would like to include
 https://issues.apache.org/jira/browse/HADOOP-4906 into the build as well.
 Would it be easier if I downloaded trunk and applied the patch or is there
 a
 branch that I can download with the patch already integrated and install
 onto my system?

 Thanks,
 Philip





Re: Maven repo for Hadoop

2009-01-19 Thread Owen O'Malley


On Jan 17, 2009, at 5:53 PM, Chanwit Kaewkasi wrote:


I would like to integrate Hadoop to my project using Ivy.
Is there any maven repository containing Hadoop jars that I can point
my configuration to?


Not yet, but soon. We recently introduced ivy into Hadoop, so I  
believe we'll upload the pom and jar for 0.20.0 when it is released.


-- Owen


Re: Performance testing

2009-01-19 Thread Jothi Padmanabhan
Hi, see answers inline below

HTH,
Jothi

 I would like to know:
 
 1. How can block size impact the performance of a mapred job.

From the M/R side, the fileSystem block size of the input files is treated
as an upper bound for input splits. . Since each input split translates into
one map, this can affect the actual number of maps for the job

 2. Does the performance improve if I setup NameNode and JobTracker on
 different machine. At present,
 I am running Namenode and JobTracker on the same machine as Master
 interconnected to 2 slave machines running Datanode and TaskTracker

Intuitively, it should help. Namenode is really memory intensive and the job
tracker could also be heavily loaded depending on the number of concurrent
jobs running and the number of maps and reducers of these jobs (for
scheduling).

 3. What should be the replication factor for a 3 node cluster

I think having a higher replication factor might not increase performance
for a 3 node cluster, it might degrade if at all because of the extra
replication. If replication is only for performance and not for
availability/fault tolerance, you could try setting the replication factor
to a smaller number (1?).

 4. How does io.sort.mb impact the performance of the cluster

Look here
http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html

 
 Thanks,
 Sandeep 
 
 
 Brian Bockelman wrote:
 
 Hey Sandeep,
 
 I'd do a couple of things:
 1) Run your test.  Do something which will be similar to your actual
 workflow.
 2) Save the resulting Ganglia plots.  This will give you a hint as to
 where things are bottlenecking (memory, CPU, wait I/O).
 3) Watch iostat and find out the I/O rates during the test.  Compare
 this to the I/O rates of a known I/O benchmark (i.e., Bonnie+).
 4) Finally, watch the logfiles closely.  If you start to overload
 things, you'll usually get a pretty good indication from Hadoop where
 things go wrong.  Once something does go wrong, *then* look through
 the parameters to see what can be done.
 
 There's about a hundred things which can go wrong between the kernel,
 the OS, Java, and the application code.  It's difficult to make an
 educated guess beforehand without some hint from the data.
 
 Brian
 
 On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote:
 
 
 Hi Brian,
 
 That's what my issue is i.e. How do I ascertain the bottleneck or
 in other
 words if the results obtained after doing the performance testing
 are not
 upto the mark then How do I find the bottleneck.
 
 How can we confidently say that OS and hardware are the culprits. I
 understand that by using the latest OS and hardware can improve the
 performance irrespective of the application but my real worry is
 What Next
 . How can I further increase the performance. What should I look
 for which
 can suggest or point the areas which can be potential problems or
 hotspot.
 
 Thanks for your comments.
 
 ~Sandeep~
 
 
 Brian Bockelman wrote:
 
 Hey Sandeep,
 
 I would warn against premature optimization: first, run your test,
 then see how far from your target you are.
 
 Of course, I'd wager you'd find that the hardware you are using is
 woefully underpowered and that your OS is 5 years old.
 
 Brian
 
 On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:
 
 
 Hi,
 
 I am trying to create a hadoop cluster which can handle 2000 write
 requests
 per second.
 In each write request I would writing a line of size 1KB in a file.
 
 I would be using machine having following configuration:
 Platfom: Red Hat Linux 9.0
 CPU : 2.07 GHz
 RAM : 1GB
 
 Can anyone help in giving me some pointers/guideline as to how to go
 about
 setting up such a cluster.
 What are the configuration parameters in hadoop with which we can
 tweak to
 ehance the performance of the hadoop cluster.
 
 Thanks,
 Sandeep
 -- 
 View this message in context:
 http://www.nabble.com/Performance-testing-tp21216266p21216266.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 
 -- 
 View this message in context:
 http://www.nabble.com/Performance-testing-tp21216266p21228264.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 



Distributed Key-Value Databases

2009-01-19 Thread Philip (flip) Kromer
Hey y'all,

There've been a few questions about distributed database solutions (a
partial list: HBase, Voldemort, Memcached, ThruDB, CouchDB, Ringo, Scalaris,
Kai, Dynomite, Cassandra, Hypertable, as well as the closed Dynamo,
BigTable, SimpleDB).

For someone using Hadoop at scale, what problem aspects would recommend one
of those over another?
And in your subjective judgement, do any of these seem especially likely to
succeed?

Richard Jones of Last.fm just posted an overview with a great deal of
engineering insight:

http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
His focus is a production web server farm, and so in some ways orthogonal to
the crowd here -- but still highly recommended.  Swaroop CH of Yahoo wrote a
broad introduction to distributed DBs I also found useful:
  http://www.swaroopch.com/notes/Distributed_Storage_Systems

Both give HBase short shrift, though my impression is that it is the leader
among open projects for massive unordered dataset problems. The answer also,
though, doesn't seem to be a simple If you're using Hadoop you should be
using HBase, dummy.

I don't have the expertise to write this kind of overview from the hadoop /
big data perspective but would eagerly read such an article from someone who
does, or to summarize the insights of the list.

===

In lieu yet of such a summary, pointers to a few relevant threads:
*
http://www.nabble.com/Why-is-scaling-HBase-much-simpler-then-scaling-a-relational-db--tt18869660.html#a19093685

  (especially Jonathan Gray's breakdown)
* HBase Performance
http://www.mail-archive.com/hadoop-u...@lucene.apache.org/msg02540.html
  (and the paper by Stonebraker and friends:
http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf)
*
http://www.nabble.com/Serving-contents-of-large-MapFiles-SequenceFiles-from-memory-across-many-machines-tt19546012.html#a19574917
* On specific problem domains:
  http://www.nabble.com/Indexed-Hashtables-tt21470024.html#a21470848

http://www.nabble.com/Why-can%27t-Hadoop-be-used-for-online-applications---tt19461962.html#a19471894
  http://www.nabble.com/Architecture-question.-tt21100766.html#a21100766

flip

(noted in passing: a huge proportion of the development seems to be coming
out of commercial enterprises and not the academic/HPC community. I worry my
ivory tower is hung up on big iron and the top500.org list, at the expense
of solving the many interesting problems these unlock.)
-- 
http://www.infochimps.org
Connected Open Free Data


hadoop balanceing data

2009-01-19 Thread Billy Pearson
Why do we not use the Remaining % in place of use Used % when we are 
selecting datanode for new data and when running the balancer.
form what I can tell we are using the use % used and we do not factor in non 
DFS Used at all.
I see a datanode with only a 60GB hard drive fill up completely 100% before 
the other servers that have 130+GB hard drives get half full.
Seams like Trying to keep the same % free on the drives in the cluster would 
be more optimal in production.

I know this still may not be perfect but would be nice if we tried.

Billy