Inputs of Mapreduce

2010-07-13 Thread Khaled BEN BAHRI

Hello to all

I'm novice in working with mapreduce and i'm developping a mapreduce  
function that take xml documents as inputs.


How can i make input files and precise it to the map function

Thanks for help

Best regards
Khaled



acces to hdfs by ui browser

2010-07-13 Thread Khaled BEN BAHRI

hi to all

i configure hadoop and all deamons are running, when i want to access  
by internet browser to hdfs it fails.


thanks for your help.

best regards
khaled



Re: acces to hdfs by ui browser

2010-07-13 Thread Zhang Jianfeng
What messages shown?

jzhang

On Tue, Jul 13, 2010 at 5:15 PM, Khaled BEN BAHRI 
khaled.ben_ba...@it-sudparis.eu wrote:

 hi to all

 i configure hadoop and all deamons are running, when i want to access by
 internet browser to hdfs it fails.

 thanks for your help.

 best regards
 khaled




Re: Inputs of Mapreduce

2010-07-13 Thread edward choi
Khaled,

Hadoop mapreduce innately takes in file line by line.
XML files are not comprised of single lines.
So you will have to pack a single xml document into a single line.
Or you can make your own input format, which you need to refer to a guide
book.

2010/7/13 Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu

 Hello to all

 I'm novice in working with mapreduce and i'm developping a mapreduce
 function that take xml documents as inputs.

 How can i make input files and precise it to the map function

 Thanks for help

 Best regards
 Khaled




Please help...

2010-07-13 Thread Tonci Buljan
Please help me, I can't figure out how to fix this problem.
I have a cluster of virtual machines under VMWare (windows XP is original
OS):

Ubuntu 8.10
Intel Pentium DUAL CPU E2180 @ 2 GHZ
Memory 1024 MB

I have a namenode and 8 more datanodes.
I want to start teragen and terasort programs and do a benchmark analysis of
a cluster running  1, 3 and all 8 datanodes.
Datanodes have only 20GB configured HDFS capacity each, so it is a total of
cca 150GB total.
I have no problem generating the input data with 2 or 8 maps but problem
comes out with terasort. When it comes to reduce phase, it generates a
following error:

10/07/13 10:59:40 INFO mapred.JobClient: Task Id :
attempt_201007131052_0002_r_00_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.


As I understand I have to setup these parameters in mapred-site.xml to
override default values:

property
  namemapred.map.tasks/name
  value?/value
/property
property
  namemapred.reduce.tasks/name
  value?/value
/property

Does anyone know how to setup number of reducers so that it works :).

Thank you...


Re: Inputs of Mapreduce

2010-07-13 Thread Paul Ingles
We tried using the hadoop streaming xml format a while ago and it didn't quite 
go as expected. I don't remember why, but, it gave some weird results- missing 
some records off, getting to 98% complete and then stopping etc.

The Mahout project also has an XmlInputFormat [1] that we ended up using. I 
also posted something on my blog about it all [2], and a little about my 
understanding (so far) of input formats and record readers etc.

Hope that helps,
Paul

1. 
http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java
2. http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

On 13 Jul 2010, at 12:26, Shuja Rehman wrote:

 Hi Khaled,
 XML files can be processed using hadoop streaming. check out the following
 link.
 
 http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
 
 Regards
 Shuja
 
 On Tue, Jul 13, 2010 at 2:24 PM, edward choi mp2...@gmail.com wrote:
 
 Khaled,
 
 Hadoop mapreduce innately takes in file line by line.
 XML files are not comprised of single lines.
 So you will have to pack a single xml document into a single line.
 Or you can make your own input format, which you need to refer to a guide
 book.
 
 2010/7/13 Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu
 
 Hello to all
 
 I'm novice in working with mapreduce and i'm developping a mapreduce
 function that take xml documents as inputs.
 
 How can i make input files and precise it to the map function
 
 Thanks for help
 
 Best regards
 Khaled
 
 
 
 
 
 
 -- 
 Regards
 Shuja-ur-Rehman Baig
 _
 MS CS - School of Science and Engineering
 Lahore University of Management Sciences (LUMS)
 Sector U, DHA, Lahore, 54792, Pakistan
 Cell: +92 3214207445



using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread Nathan Grice
We are trying to load data into hdfs from one of the slaves and when the put
command is run from a slave(datanode) all of the blocks are written to the
datanode's hdfs, and not distributed to all of the nodes in the cluster. It
does not seem to matter what destination format we use ( /filename vs
hdfs://master:9000/filename) it always behaves the same.
Conversely, running the same command from the namenode distributes the files
across the datanodes.

Is there something I am missing?

-Nathan


Setting different hadoop-env.sh for DataNode, TaskTracker

2010-07-13 Thread Matt Pouttu-Clarke
Can anyone suggest a way to set different hadoop-env.sh values for DataNode
and TaskTracker without having to duplicate the whole Hadoop conf directory?
For example, to set a different HADOOP_NICENESS for DataNode and
TaskTracker.

TIA
Matt Pouttu-Clarke

iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: Setting different hadoop-env.sh for DataNode, TaskTracker

2010-07-13 Thread Edward Capriolo
On Tue, Jul 13, 2010 at 10:46 AM, Matt Pouttu-Clarke
matt.pouttu-cla...@icrossing.com wrote:
 Can anyone suggest a way to set different hadoop-env.sh values for DataNode
 and TaskTracker without having to duplicate the whole Hadoop conf directory?
 For example, to set a different HADOOP_NICENESS for DataNode and
 TaskTracker.

 TIA
 Matt Pouttu-Clarke

 iCrossing Privileged and Confidential Information
 This email message is for the sole use of the intended recipient(s) and may 
 contain confidential and privileged information of iCrossing. Any 
 unauthorized review, use, disclosure or distribution is prohibited. If you 
 are not the intended recipient, please contact the sender by reply email and 
 destroy all copies of the original message.




hadoop-env.sh is a script file you are free to write arbitrary shell code.

if [ $HOSTNAME = server1 ] ; then
  dothis
else
  dothat
fi

This allows you to push one file to all systems but now you manage
scripts not files.


Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread C.V.Krishnakumar
Hi,
I am a newbie. I am curious to know how you discovered that all the blocks are 
written to datanode's hdfs? I thought the replication by namenode was 
transparent. Am I missing something?
Thanks,
Krishna
On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:

 We are trying to load data into hdfs from one of the slaves and when the put
 command is run from a slave(datanode) all of the blocks are written to the
 datanode's hdfs, and not distributed to all of the nodes in the cluster. It
 does not seem to matter what destination format we use ( /filename vs
 hdfs://master:9000/filename) it always behaves the same.
 Conversely, running the same command from the namenode distributes the files
 across the datanodes.
 
 Is there something I am missing?
 
 -Nathan



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread Nathan Grice
To test the block distribution, run the same put command from the NameNode
and then again from the DataNode.
Check the HDFS filesystem after both commands. In my case, a 2GB file was
distributed mostly evenly across the datanodes when put was run on the
NameNode, and then put only on the DataNode where I ran the put command

On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar cvkrishnaku...@me.comwrote:

 Hi,
 I am a newbie. I am curious to know how you discovered that all the blocks
 are written to datanode's hdfs? I thought the replication by namenode was
 transparent. Am I missing something?
 Thanks,
 Krishna
 On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:

  We are trying to load data into hdfs from one of the slaves and when the
 put
  command is run from a slave(datanode) all of the blocks are written to
 the
  datanode's hdfs, and not distributed to all of the nodes in the cluster.
 It
  does not seem to matter what destination format we use ( /filename vs
  hdfs://master:9000/filename) it always behaves the same.
  Conversely, running the same command from the namenode distributes the
 files
  across the datanodes.
 
  Is there something I am missing?
 
  -Nathan




Re: WARN util.NativeCodeLoader: Unable to load native-hadoop library

2010-07-13 Thread Allen Wittenauer

On Jul 13, 2010, at 7:17 AM, Some Body wrote:
 I followed the steps from the native library guide 

We need to rewrite that guide.  It is pretty clear that we have overloaded the 
term native libraries enough that no one understands what anyone else is 
talking about.


 1. put the OS's libz libs in 
  [r...@namenode]# pwd
 /opt/hadoop/lib/native
 
  [r...@namenode]# find . -name '*libz*' 
 ./Linux-amd64-64/libz.so.1
 ./Linux-amd64-64/libz.so.1.2.1.2
 ./Linux-amd64-64/libz.so
 ./Linux-i386-32/libz.so.1
 ./Linux-i386-32/libz.so.1.2.1.2
 ./Linux-i386-32/libz.so 


You should have libhadoop there and it should be linked to libz.  Run ldd 
against libhadoop and see what comes out.



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread Allen Wittenauer

When you write on a machine running a datanode process, the data is *always* 
written locally first.  This is to provide an optimization to the MapReduce 
framework.   The lesson here is that you should *never* use a datanode machine 
to load your data.  Always do it outside the grid.

Additionally, you can use fsck (filename) -files -locations -blocks to see 
where those blocks have been written.  

On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote:

 To test the block distribution, run the same put command from the NameNode
 and then again from the DataNode.
 Check the HDFS filesystem after both commands. In my case, a 2GB file was
 distributed mostly evenly across the datanodes when put was run on the
 NameNode, and then put only on the DataNode where I ran the put command
 
 On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar 
 cvkrishnaku...@me.comwrote:
 
 Hi,
 I am a newbie. I am curious to know how you discovered that all the blocks
 are written to datanode's hdfs? I thought the replication by namenode was
 transparent. Am I missing something?
 Thanks,
 Krishna
 On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:
 
 We are trying to load data into hdfs from one of the slaves and when the
 put
 command is run from a slave(datanode) all of the blocks are written to
 the
 datanode's hdfs, and not distributed to all of the nodes in the cluster.
 It
 does not seem to matter what destination format we use ( /filename vs
 hdfs://master:9000/filename) it always behaves the same.
 Conversely, running the same command from the namenode distributes the
 files
 across the datanodes.
 
 Is there something I am missing?
 
 -Nathan
 
 



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-13 Thread C.V.Krishnakumar
Oh. Thanks for the reply.
Regards,
Krishna
On Jul 13, 2010, at 9:51 AM, Allen Wittenauer wrote:

 
 When you write on a machine running a datanode process, the data is *always* 
 written locally first.  This is to provide an optimization to the MapReduce 
 framework.   The lesson here is that you should *never* use a datanode 
 machine to load your data.  Always do it outside the grid.
 
 Additionally, you can use fsck (filename) -files -locations -blocks to see 
 where those blocks have been written.  
 
 On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote:
 
 To test the block distribution, run the same put command from the NameNode
 and then again from the DataNode.
 Check the HDFS filesystem after both commands. In my case, a 2GB file was
 distributed mostly evenly across the datanodes when put was run on the
 NameNode, and then put only on the DataNode where I ran the put command
 
 On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar 
 cvkrishnaku...@me.comwrote:
 
 Hi,
 I am a newbie. I am curious to know how you discovered that all the blocks
 are written to datanode's hdfs? I thought the replication by namenode was
 transparent. Am I missing something?
 Thanks,
 Krishna
 On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:
 
 We are trying to load data into hdfs from one of the slaves and when the
 put
 command is run from a slave(datanode) all of the blocks are written to
 the
 datanode's hdfs, and not distributed to all of the nodes in the cluster.
 It
 does not seem to matter what destination format we use ( /filename vs
 hdfs://master:9000/filename) it always behaves the same.
 Conversely, running the same command from the namenode distributes the
 files
 across the datanodes.
 
 Is there something I am missing?
 
 -Nathan
 
 
 



Why Hadoop's release process has slowed down???

2010-07-13 Thread ruslan usifov
Hello

As mentioned in http://wiki.apache.org/hadoop/Hbase/HBaseVersions, hadoop
slow down it's development. I interesting why this happens if it's true


Debuging hadoop core

2010-07-13 Thread Pramy Bhats
Hi,

I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To
simplify the debug process, firstly I setup the Hadoop in single-node mode
on my localhost.


a)  configure debug in eclipse,

under tab main:
  project : hadoop-all
  main-class: org.apache.hadoop.util.RunJar

   under tab arguments:

   program arguments: absolute path for wordcount jar file/wordcount.jar
 org.wordcount.WordCount   input-text-file-already-in-hdfs (text)
 desired-output-file (output)
   VM arguments: -Xmx256M


  under tab classpath:

  user entries :  add external jar  ( hadoop-0.20.3-core-dev.jar ) == so
that I can debug my new built hadoop core jar.


under tab source:

 I add the source file folder for the wordcount example ( in order lookup
for the debug process).


I apply these configuration and start debug process.


b)  the debugging works fine, and i can perform all operations for debug.
However, i get following problem

2010-07-14 00:02:15,816 WARN  conf.Configuration
(Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in
the classpath. Usage of hadoop-site.xml is deprecated. Instead use
core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
core-default.xml, mapred-default.xml and hdfs-default.xml respectively
2010-07-14 00:02:16,535 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) -
Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread main
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path
does not exist: file:/home/hadoop/code/hadoop-0.20.2/text*
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at org.selfadjust.wordcount.WordCount.run(WordCount.java:32)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.selfadjust.wordcount.WordCount.main(WordCount.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


However, the file named text is the file already stored in the hdfs.


Could you please help me with debugging process here, any pointers to the
debugging environment would be very helpful.


thanks,

--PB


Re: Debuging hadoop core

2010-07-13 Thread Pramy Bhats
Hello,

When I copy the input file text in the *
home/hadoop/code/hadoop-0.20.2/text*

The debugging works fine except the case that Hadoop reads and write in
local file system. This is because the parameters specified by me doesn't
tell that the file is in HDFS. It simple says that input and output
filename, and therefore while debugging the hadoop reads and writes from the
local file system.


How can i specify the path of input and output filename as absolute hdfs
path for debugging purpose ?


thanks,
--PB
*
*
On Wed, Jul 14, 2010 at 12:07 AM, Pramy Bhats pramybh...@googlemail.comwrote:

 Hi,

 I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To
 simplify the debug process, firstly I setup the Hadoop in single-node mode
 on my localhost.


  a)  configure debug in eclipse,

 under tab main:
   project : hadoop-all
   main-class: org.apache.hadoop.util.RunJar

under tab arguments:

program arguments: absolute path for wordcount jar file/wordcount.jar
  org.wordcount.WordCount   input-text-file-already-in-hdfs (text)
  desired-output-file (output)
VM arguments: -Xmx256M


   under tab classpath:

   user entries :  add external jar  ( hadoop-0.20.3-core-dev.jar ) == so
 that I can debug my new built hadoop core jar.


 under tab source:

  I add the source file folder for the wordcount example ( in order lookup
 for the debug process).


 I apply these configuration and start debug process.


 b)  the debugging works fine, and i can perform all operations for debug.
 However, i get following problem

 2010-07-14 00:02:15,816 WARN  conf.Configuration
 (Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in
 the classpath. Usage of hadoop-site.xml is deprecated. Instead use
 core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
 core-default.xml, mapred-default.xml and hdfs-default.xml respectively
 2010-07-14 00:02:16,535 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) -
 Initializing JVM Metrics with processName=JobTracker, sessionId=
 Exception in thread main
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path
 does not exist: file:/home/hadoop/code/hadoop-0.20.2/text*
  at
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
 at
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
  at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
  at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
  at org.selfadjust.wordcount.WordCount.run(WordCount.java:32)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.selfadjust.wordcount.WordCount.main(WordCount.java:43)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


 However, the file named text is the file already stored in the hdfs.


 Could you please help me with debugging process here, any pointers to the
 debugging environment would be very helpful.


 thanks,

  --PB





Thoughts about Hadoop cluster hardware

2010-07-13 Thread u235sentinel
So we're talking to Dell about their new PowerEdge c2100 servers for a 
Hadoop cluster but I'm wondering.  Isn't this still a little overboard 
for nodes in a cluster?  I'm wondering if we bought say 100 poweredge 
2750's instead of just 50 c2100's.  The price would be about the same 
for the configuration we're talking about and we would get twice as many 
nodes.


I'm curious if any other's are running Dell PowerEdge servers with Hadoop.

We've also been kicking the idea around of going with blade servers 
(Dell and/or HP).


Just curious

Thanks!!


Re: Debuging hadoop core

2010-07-13 Thread Ted Yu
Find hadoop-site.xml which Eclipse claimed was in your classpath.
In the same directory, look for core-site.xml and add the following:
property
  namefs.default.name/name
  valuehdfs://sjc9-flash-grid04.ciq.com:9000/value

On Tue, Jul 13, 2010 at 3:07 PM, Pramy Bhats pramybh...@googlemail.comwrote:

 Hi,

 I am trying to debug the new built hadoop-core-dev.jar in Eclipse. To
 simplify the debug process, firstly I setup the Hadoop in single-node mode
 on my localhost.


 a)  configure debug in eclipse,

under tab main:
  project : hadoop-all
  main-class: org.apache.hadoop.util.RunJar

   under tab arguments:

   program arguments: absolute path for wordcount jar file/wordcount.jar
  org.wordcount.WordCount   input-text-file-already-in-hdfs (text)
  desired-output-file (output)
   VM arguments: -Xmx256M


  under tab classpath:

  user entries :  add external jar  ( hadoop-0.20.3-core-dev.jar ) == so
 that I can debug my new built hadoop core jar.


 under tab source:

  I add the source file folder for the wordcount example ( in order lookup
 for the debug process).


 I apply these configuration and start debug process.


 b)  the debugging works fine, and i can perform all operations for debug.
 However, i get following problem

 2010-07-14 00:02:15,816 WARN  conf.Configuration
 (Configuration.java:clinit(176)) - DEPRECATED: hadoop-site.xml found in
 the classpath. Usage of hadoop-site.xml is deprecated. Instead use
 core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of
 core-default.xml, mapred-default.xml and hdfs-default.xml respectively
 2010-07-14 00:02:16,535 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) -
 Initializing JVM Metrics with processName=JobTracker, sessionId=
 Exception in thread main
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: *Input path
 does not exist: file:/home/hadoop/code/hadoop-0.20.2/text*
 at

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
 at

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
 at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
 at org.selfadjust.wordcount.WordCount.run(WordCount.java:32)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.selfadjust.wordcount.WordCount.main(WordCount.java:43)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


 However, the file named text is the file already stored in the hdfs.


 Could you please help me with debugging process here, any pointers to the
 debugging environment would be very helpful.


 thanks,

 --PB



Re: Thoughts about Hadoop cluster hardware

2010-07-13 Thread Allen Wittenauer

On Jul 13, 2010, at 5:00 PM, u235sentinel wrote:

 So we're talking to Dell about their new PowerEdge c2100 servers for a Hadoop 
 cluster but I'm wondering.  Isn't this still a little overboard for nodes in 
 a cluster?  I'm wondering if we bought say 100 poweredge 2750's instead of 
 just 50 c2100's.  The price would be about the same for the configuration 
 we're talking about and we would get twice as many nodes.

Ultimately, it depends upon your job flow and how much data you have.  

FWIW we're currently using a Sun equivalent of the C2100s w/8 of the 12 drive 
slots filled.  You need a *LOT* of iops to make it worth while.  [From what 
I've seen, even people who think they have a lot of iops generally have other 
problems with their code/tuning that are causing the iops.   So even if you 
think you have a lot, you may not.]

 I'm curious if any other's are running Dell PowerEdge servers with Hadoop.
 
 We've also been kicking the idea around of going with blade servers (Dell 
 and/or HP).

If you are thinking traditional blade where storage is comes mainly from NAS or 
SAN, you are going to be very, very unhappy unless your data set is very, very 
tiny.

Check out the PoweredBy page on the wiki.  Quite a few folks list their gear. 
FWIW, we're currently evaluating HP SLs and should be getting some Dell C6100s 
in soon, assuming Dell can deliver the eval unit on time.