RE: Stop MR jobs after N records have been produced ?

2008-09-05 Thread Arash Rajaiyan
bah bah
salaam amir jaan
mibinam ke to ham be hadoop alaghe mandi :D

--- On Fri, 9/5/08, Amir Youssefi [EMAIL PROTECTED] wrote:
From: Amir Youssefi [EMAIL PROTECTED]
Subject: RE: Stop MR jobs after N records have been produced ?
To: core-user@hadoop.apache.org
Date: Friday, September 5, 2008, 3:48 AM

Also see following jira for more discussions:

http://issues.apache.org/jira/browse/HADOOP-3973 

I will close this as new interface will address the issue.

- Amir 

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] On Behalf Of Owen
O'Malley
Sent: Thursday, September 04, 2008 2:13 PM
To: core-user@hadoop.apache.org
Subject: Re: Stop MR jobs after N records have been produced ?


On Sep 4, 2008, at 12:48 PM, Tarandeep Singh wrote:

 Can I stop Map-Reduce jobs after mappers (or reducers) have produced N

 records ?

You could do this pretty easily by implementing a custom MapRunnable.  
There is no equivalent for reduces. The interface proposed in
HADOOP-1230 would support that kind of application. See:

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/
hadoop/mapreduce/

Look at the new Mapper and Reducer interfaces.

-- Owen



  

Sharing Memory across Map task [multiple cores] runing in same machine

2008-09-05 Thread Amit Kumar Singh
Can we use something like RAM FS to share static data across map tasks.

Scenario,
1) Quadcore machine
2) 2 1-TB Disk
3) 8 GB ram,

Now Ii need ~2.7 GB ram per Map process to load some static data in memory
using which i would be processing data.(cpu intensive jobs)

Can i share memory across mappers on the same machine so that memory
footprint is less and i can run more than 4 mappers simultaneously
utilizing all 4 cores.

Can we use stuff like RamFS



Re: Sharing Memory across Map task [multiple cores] runing in same machine

2008-09-05 Thread Devaraj Das

Hadoop doesn't support this natively. So if you need this kind of a
functionality, you'd need to code your application in such a way. But I am
worried about the race conditions in determining which task should first
create the ramfs and load the data.
If you can provide atomicity in determining whether the ramfs has been
created and data loaded, and if not, then do the creation/load, then things
should work. 
If atomicity cannot be guaranteed, you might consider this -
1) Run a job with only maps that creates the ramfs and loads the data (if
your cluster is small you can do this manually). You can use distributed
cache to store the data you want to load.
2) Run your job that processes the data
3) Run a third job to delete the ramfs.


On 9/5/08 1:29 PM, Amit Kumar Singh [EMAIL PROTECTED] wrote:

 Can we use something like RAM FS to share static data across map tasks.
 
 Scenario,
 1) Quadcore machine
 2) 2 1-TB Disk
 3) 8 GB ram,
 
 Now Ii need ~2.7 GB ram per Map process to load some static data in memory
 using which i would be processing data.(cpu intensive jobs)
 
 Can i share memory across mappers on the same machine so that memory
 footprint is less and i can run more than 4 mappers simultaneously
 utilizing all 4 cores.
 
 Can we use stuff like RamFS
 




Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

2008-09-05 Thread Hemanth Yamijala

Filippo Spiga wrote:

This procedure allows me to
- use persistent HDFS on all cluster, placing namenode to frontend (always
up and running) and datanode to other nodes
- submit a lot of jobs to resource manager trasparently without any problem
and manage jobs priority/reservation with MAUI as simple as other classical
HPC jobs
- execute jobtracker and tasktracker services on the nodes chosen by TORQUE
(in particular, the first node selected becomes the jobtracker)
- store logs for different users into separated directory
- run only one job at time (but probably multiple map/reduce jobs can
runs together
because different jobs use different subset of nodes)

Probably HOD does what I can do with my raw script... it's possibile that I
don't understand well the userguide...

  
Filippo, HOD indeed allows you to do all these things, and a little bit 
more. On the other
hand your script executes the jobtracker on the first node always, which 
also seems
useful to me. It will be nice if  you can still try HOD and see if it 
makes your life

simpler in any way. :-)


Sorry for my english :-P

Regards

2008/9/2 Hemanth Yamijala [EMAIL PROTECTED]

  

Allen Wittenauer wrote:



On 8/18/08 11:33 AM, Filippo Spiga [EMAIL PROTECTED] wrote:


  

Well but I haven't understand how I should configurate HOD to work in
this
manner.

For HDFS I folllow this sequence of steps
- conf/master contain only master node of my cluster
- conf/slaves contain all nodes
- I start HDFS using bin/start-dfs.sh




   Right, fine...



  

Potentially I would allow to use all nodes for MapReduce.
For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
change only the gridservice-hdfs section?




   I was hoping the HOD folks would answer this question for you, but they
are apparently sleeping. :)



  

Woops ! Sorry, I missed this.



   Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
use that as the -default- HDFS. That doesn't prevent a user from using HOD
to create a custom HDFS as part of their job submission.



  

Allen's answer is perfect. Please refer to
http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
for more information about how to set up the gridservice-hdfs section to
use a static or
external HDFS.







  




Re: Integrate HADOOP and Map/Reduce paradigm into HPC environment

2008-09-05 Thread Hemanth Yamijala

Filippo Spiga wrote:

This procedure allows me to
- use persistent HDFS on all cluster, placing namenode to frontend (always
up and running) and datanode to other nodes
- submit a lot of jobs to resource manager trasparently without any problem
and manage jobs priority/reservation with MAUI as simple as other classical
HPC jobs
- execute jobtracker and tasktracker services on the nodes chosen by TORQUE
(in particular, the first node selected becomes the jobtracker)
- store logs for different users into separated directory
- run only one job at time (but probably multiple map/reduce jobs can
runs together
because different jobs use different subset of nodes)

Probably HOD does what I can do with my raw script... it's possibile that I
don't understand well the userguide...

  
Filippo, HOD indeed allows you to do all these things, and a little bit 
more. On the other
hand your script executes the jobtracker on the first node always, which 
also seems
useful to me. It will be nice if  you can still try HOD and see if it 
makes your life

simpler in any way. :-)


Sorry for my english :-P

Regards

2008/9/2 Hemanth Yamijala [EMAIL PROTECTED]

  

Allen Wittenauer wrote:



On 8/18/08 11:33 AM, Filippo Spiga [EMAIL PROTECTED] wrote:


  

Well but I haven't understand how I should configurate HOD to work in
this
manner.

For HDFS I folllow this sequence of steps
- conf/master contain only master node of my cluster
- conf/slaves contain all nodes
- I start HDFS using bin/start-dfs.sh




   Right, fine...



  

Potentially I would allow to use all nodes for MapReduce.
For HOD which parameter should I set in contrib/hod/conf/hodrc? Should I
change only the gridservice-hdfs section?




   I was hoping the HOD folks would answer this question for you, but they
are apparently sleeping. :)



  

Woops ! Sorry, I missed this.



   Anyway, yes, if you point gridservice-hdfs to a static HDFS,  it should
use that as the -default- HDFS. That doesn't prevent a user from using HOD
to create a custom HDFS as part of their job submission.



  

Allen's answer is perfect. Please refer to
http://hadoop.apache.org/core/docs/current/hod_user_guide.html#Using+an+external+HDFS
for more information about how to set up the gridservice-hdfs section to
use a static or
external HDFS.







  




Re: Hadoop for computationally intensive tasks (no data)

2008-09-05 Thread Steve Loughran

Tenaali Ram wrote:

Hi,

I am new to hadoop. What I have understood so far is- hadoop is used to
process huge data using map-reduce paradigm.

I am working on problem where I need to perform large number of
computations, most computations can be done independently of each other (so
I think each mapper can handle one or more such computations). However there
is no data involved. Its just number crunching job. Is it suited for Hadoop
?



well, you can have the MR jobs stick data out into the filesystem. So 
even though they don't start of located, they end up running where the 
output needs to go.



Has anyone used hadoop for merely number crunching? If yes, how should I
define input for the job and ensure that computations are distributed to all
nodes in the grid?


The current scheduler moves work to near where the data sources are, 
going for the same machine or same rack, looking for a task tracker with 
a spare slot. There isn't yet any scheduler that worried more about 
pure computation, where you need to consider current CPU load, memory 
consumption and power budget -whether your rack is running so hot its at 
risk of being shut down, or at least throttled back. That's the kind of 
scheduling where the e-science and grid toolkit people have the edge.


But now that the version of hadoop in SVN has support for plug-in 
scheduling, someone has the opportunity to write a new scheduler, one 
that focuses on pure computation...





critical name node problem

2008-09-05 Thread Andreas Kostyrka
Hi!

My namenode has run out of space, and now I'm getting the following:

08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: 
failed to 
remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because 
it does not exist
08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000
08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at 
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)
at 
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at 
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at 
ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196

hadoop-0.17.1 btw.

What do I do now?

Andreas


signature.asc
Description: This is a digitally signed message part.


contrib join package

2008-09-05 Thread Rahul Sood
Hi,

 

Is there any detailed documentation on the
org.apache.hadoop.contrib.utils.join package ? I have a simple Join task
consisting of 2 input datasets. Each contains tab-separated records.

 

Set1: Record format = field1\tfield2\tfield3\tfield4\tfield5

Set2: Record format = field1\tfield2\tfield3

 

Join criterion: Set1.field1 = Set2.field1

 

Output: Set2.field2\tSet1.field2\tSet1.field3\tSet1.field4

 

The org.apache.hadoop.contrib.utils.join package contains DataJoinMapperBase
and DataJoinReducerBase abstract classes, and a TaggedMapOutput class which
should be the base class for the mapper output values. But there aren't any
examples showing how these classes should be used to implement inner or
outer joins in a generic manner.

 

If anybody has used this package and would like to share their experience,
please let me know.

 

Thanks,

 

Rahul Sood

[EMAIL PROTECTED]

 



Re: JVM Spawning

2008-09-05 Thread Doug Cutting
LocalJobRunner allows you to test your code with everything running in a 
single JVM.  Just set mapred.job.tracker=local.


Doug

Ryan LeCompte wrote:

I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing JVM's... My input is broken up into 5 text files is
there a way I could start the job such that it only uses 1 map to
process the whole thing? I guess I'd have to concatenate the files
into 1 file and somehow turn off splitting?

Ryan


On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote:

On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote:


Beginner's question:

If I have a cluster with a single node that has a max of 1 map/1
reduce, and the job submitted has 50 maps... Then it will process only
1 map at a time. Does that mean that it's spawning 1 new JVM for each
map processed? Or re-using the same JVM when a new map can be
processed?

It creates a new JVM for each task. Devaraj is working on
https://issues.apache.org/jira/browse/HADOOP-249
which will allow the jvms to run multiple tasks sequentially.

-- Owen



Re: critical name node problem

2008-09-05 Thread Andreas Kostyrka
Ok, googling a little bit around, the solution seems to either delete the 
edits file, which in my case would be non-cool (24MB worth of edits in 
there), or truncate it correctly.

So I used the following script to figure out how much data needs to be 
dropped:

LEN=25497570

while true
do
   dd if=edits.org of=edits bs=$LEN count=1
   time hadoop namenode
   if [[ $? -ne 255 ]]
   then
  echo $LEN seems to have worked.
  exit 0
   fi
   LEN=$(expr $LEN - 1)
done

Guess something like this might make sense to add 
http://wiki.apache.org/hadoop/TroubleShooting
not everyone will be able to figure out how to get rid of the last 
incomplete record.

Another idea would be a tool or namenode startup mode that would make it 
ignore EOFExceptions to recover as much of the edits as possible.

Andreas

On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote:
 Hi!

 My namenode has run out of space, and now I'm getting the following:

 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete:
 failed to
 remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz
 because it does not exist
 08/09/05 09:23:22 INFO ipc.Server: Stopping server on 9000
 08/09/05 09:23:22 ERROR dfs.NameNode: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
 at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
 at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:441)
 at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)
 at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)
 at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)
 at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80) at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
 at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)
 at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
 at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
 at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

 08/09/05 09:23:22 INFO dfs.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at
 ec2-67-202-42-251.compute-1.amazonaws.com/10.251.39.196

 hadoop-0.17.1 btw.

 What do I do now?

 Andreas




signature.asc
Description: This is a digitally signed message part.


Re: runtime change of hadoop.tmp.dir

2008-09-05 Thread 叶双明
Do you successed by changing it in the config file ?

2008/9/4, Deepak Diwakar [EMAIL PROTECTED]:

 Hi

 I am using hadoop in standalone mode. I want to change hadoop.tmp.dir on
 the
 runtime.
 I found that there is an option -D in command line to set configurable
 parameter, but did not succeeded.

 It would be really helpful to me if  somebody could put exact syntax for
 that?

 Thanks  regards,
 --
 - Deepak Diwakar,



Re: Timeouts at reduce stage

2008-09-05 Thread 叶双明
:)

2008/9/5, Doug Cutting [EMAIL PROTECTED]:

 Jason Venner wrote:

 We have modified the /main/ that launches the children of the task tracker
 to explicity exit, in it's finally block. That helps substantially.


 Have you submitted this as a patch?

 Doug



Re: critical name node problem

2008-09-05 Thread Allen Wittenauer



On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:

 Another idea would be a tool or namenode startup mode that would make it
 ignore EOFExceptions to recover as much of the edits as possible.

We clearly need to change the how to configure docs to make sure
people put at least two directories on two different storage systems for the
dfs.name.dir  .  This problem seems to happen quite often, and having two+
dirs helps protect against it.

We recently had one of the disks on one of our copies go bad.  The
system kept going just fine until we had a chance to reconfig the name node.

That said, I've just HADOOP-4080 to help alert admins in these
situations.



Re: contrib join package

2008-09-05 Thread Owen O'Malley
Please look at the examples in the
sourcehttp://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/data_join/src/examples/org/apache/hadoop/contrib/utils/join/directory.
Unfortunately, I don't know of any other documentation on it.

-- Owen


Custom Writeables

2008-09-05 Thread Ryan LeCompte
Hello,

Can a custom Writeable object used as a key/value contain other
writeables, like MapWriteable?

Thanks,
Ryan


Re: Custom Writeables

2008-09-05 Thread Owen O'Malley
Yes, it is pretty easy to compose Writables. Just have the write and
readFields call the nested types. Look at the way that
JobStatushttp://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/JobStatus.java's
write method calls jobid.write.

-- Owen


Re: Custom Writeables

2008-09-05 Thread Owen O'Malley
On Fri, Sep 5, 2008 at 10:18 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 Thanks! Quick question on that particular class: why are the methods
 synchronized? I didn't think that key/value objects needed to be thread
 safe?


I picked a simple example out of map/reduce framework to demonstrate the
approach. The framework has a lot of threads and so needs synchronization on
its own data types. Your maps and reduces are only called by a single thread
and so probably don't need synchronization. (The framework won't call
readFields/write on your types from multiple threads either...)

-- Owen


Re: Compare data on HDFS side

2008-09-05 Thread Karl Anderson


On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:


Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.


You could always do this as a mapreduce task.  diff --brief is  
trivial, actually finding the diffs is left as an exercise for the  
reader :)  I'm currently doing a line-oriented diff of two files where  
the order of the lines is unimportant, so I just have my reducer flag  
lines that show up an odd number of times.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Hadoop + Elastic Block Stores

2008-09-05 Thread Ryan LeCompte
Hello,

I was wondering if anyone has gotten far at all with getting Hadoop up
and running with EC2 + EBS? Any luck getting this to work in a way
that the HDFS runs on the EBS so that it isn't blown away every time
you bring up/down the EC2 Hadoop cluster? I'd like to experiment with
this next, and was curious if anyone had any luck. :)

Thanks!

Ryan


Re: rsync on 2 HDFS

2008-09-05 Thread Tsz Wo (Nicholas), Sze
Hi Deepika,

We have a utility called distcp - distributed copy.  Note that distcp itself is 
different from rsync.  However, distcp -delete is similar to rsync --delete.

distcp -delete is a new feature in 0.19.  See HADOOP-3939.  For more details 
about distcp, see http://hadoop.apache.org/core/docs/r0.18.0/distcp.html
(the doc is for 0.18, so it won't mention distcp -delete.  The 0.19 doc will 
be updated in HADOOP-3942.)

Nicholas Sze




- Original Message 
 From: Deepika Khera [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, September 5, 2008 2:42:09 PM
 Subject: rsync on 2 HDFS
 
 Hi,
 
 
 
 I wanted to do an rsync --delete between data in 2 HDFS system
 directories. Do we have a utility that could do this?
 
 
 
 I am aware that HDFS does not allow partial writes. An alternative would
 be to write a program to generate the list of differences in paths and
 then use distcp to copy the files and delete the appropriate files.
 
 
 
 Any pointers to implementations (or partial implementations)?
 
 
 
 Thanks,
 
 Deepika



How do specify certain IP to be used by datanode/namenode

2008-09-05 Thread Kevin
Hi,

The machines I am using each has multiple network cards, and hence
multiple IP addresses. Some of them look like a default IP. However, I
wish to use some other IPs, and tried to modify the hadoop-site.xml,
masters, and slaves files. But it looks that the IP still jumps to the
default ones when hadoop runs. Does anyone have an idea how I could
possibly make it work? Thank you!

-Kevin


Re: Master Recommended Hardware

2008-09-05 Thread Jean-Daniel Cryans
Camilo,

See http://wiki.apache.org/hadoop/NameNode

And see the discussion NameNode Hardware specs started here:
http://www.mail-archive.com/core-user@hadoop.apache.org/msg04109.html

This should give you the basics.

Regards,

J-D

On Fri, Sep 5, 2008 at 10:31 PM, Camilo Gonzalez [EMAIL PROTECTED] wrote:

 Hi!

 Just wondering, there have been discussions/articles about Recommended
 Hardware for the Cluster Nodes (ex:
 http://wiki.apache.org/hadoop/MachineScaling). Have there been any
 discussions/articles about Recommended Hardware Configurations for the
 Master (Namenode) machine? (Any links to previous discussions are
 appreciated).

 It is my guessing that this machine should have much more memory than the
 Data nodes, but don't know if this should be necessarily true.

 Thanks in advance,

 --
 Camilo A. Gonzalez
 Ing. de Sistemas
 Tel: 300 657 96 96



Re: Hadoop + Elastic Block Stores

2008-09-05 Thread Jean-Daniel Cryans
Ryan,

I currently have a Hadoop/HBase setup that uses EBS. It works but using EBS
implied an additional overhead of configuration (too bad you can't spawn
instances with volumes already attached to it tho I'm sure that'll come).
Shutting down instances and bringing others up also requires more
micro-management but I think Tom White wrote about it and there was a link
to it in another discussion you were part of.

Hope this helps,

J-D

On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 Hello,

 I was wondering if anyone has gotten far at all with getting Hadoop up
 and running with EC2 + EBS? Any luck getting this to work in a way
 that the HDFS runs on the EBS so that it isn't blown away every time
 you bring up/down the EC2 Hadoop cluster? I'd like to experiment with
 this next, and was curious if anyone had any luck. :)

 Thanks!

 Ryan



Re: How do specify certain IP to be used by datanode/namenode

2008-09-05 Thread Jean-Daniel Cryans
Kevin,

Did you try changing the
dfs.datanode.dns.interface/dfs.datanode.dns.nameserver/mapred.tasktracker.dns.interface/mapred.tasktracker.dns.nameserver
parameters?

J-D

On Fri, Sep 5, 2008 at 8:14 PM, Kevin [EMAIL PROTECTED] wrote:

 Hi,

 The machines I am using each has multiple network cards, and hence
 multiple IP addresses. Some of them look like a default IP. However, I
 wish to use some other IPs, and tried to modify the hadoop-site.xml,
 masters, and slaves files. But it looks that the IP still jumps to the
 default ones when hadoop runs. Does anyone have an idea how I could
 possibly make it work? Thank you!

 -Kevin



Re: help! how can i control special data to specific datanode?

2008-09-05 Thread Jean-Daniel Cryans
Hi,

I suggest that you read how data is stored in HDFS, see
http://hadoop.apache.org/core/docs/r0.18.0/hdfs_design.html

J-D

On Sat, Sep 6, 2008 at 12:11 AM, ZhiHong Fu [EMAIL PROTECTED] wrote:

 hello. I'm a new user to hadoop. and Now I hava a problem in understanding
 Hdfs.
 In such a scene.  I have several databases and  want to index them.  So
 when
 I map indexing database, I have to control which database index was stored
 in which datanode. So when database updated, The index can be updated
 correspondely.
 So how can i get to do this? thanks



Re: Hadoop + Elastic Block Stores

2008-09-05 Thread Ryan LeCompte
Good to know that you got it up and running. I'd really love to one  
day see some scripts under src/contrib/ec2/bin that can setup/mount  
the EBS volumes automatically. :-)




On Sep 5, 2008, at 11:38 PM, Jean-Daniel Cryans  
[EMAIL PROTECTED] wrote:



Ryan,

I currently have a Hadoop/HBase setup that uses EBS. It works but  
using EBS
implied an additional overhead of configuration (too bad you can't  
spawn
instances with volumes already attached to it tho I'm sure that'll  
come).

Shutting down instances and bringing others up also requires more
micro-management but I think Tom White wrote about it and there was  
a link

to it in another discussion you were part of.

Hope this helps,

J-D

On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED]  
wrote:



Hello,

I was wondering if anyone has gotten far at all with getting Hadoop  
up

and running with EC2 + EBS? Any luck getting this to work in a way
that the HDFS runs on the EBS so that it isn't blown away every time
you bring up/down the EC2 Hadoop cluster? I'd like to experiment with
this next, and was curious if anyone had any luck. :)

Thanks!

Ryan