When will hadoop 0.19.2 be released?

2009-04-24 Thread Zhou, Yunqing
currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
and I found 0.19.1 is buggy and I have already applied some patches on
hadoop jira to solve problems.
But I'm looking forward to a more stable release of hadoop.
Do you know when will 0.19.2 be released?

Thanks.


Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread jason hadoop
You could try the cloudera release based on 18.3, with many backported
features.
http://www.cloudera.com/distribution

On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com wrote:

 currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
 and I found 0.19.1 is buggy and I have already applied some patches on
 hadoop jira to solve problems.
 But I'm looking forward to a more stable release of hadoop.
 Do you know when will 0.19.2 be released?

 Thanks.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread Zhou, Yunqing
But there are already 100TB data stored on DFS.
Is there a safe solution to do such a downgrade?

On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop jason.had...@gmail.com wrote:
 You could try the cloudera release based on 18.3, with many backported
 features.
 http://www.cloudera.com/distribution

 On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com wrote:

 currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
 and I found 0.19.1 is buggy and I have already applied some patches on
 hadoop jira to solve problems.
 But I'm looking forward to a more stable release of hadoop.
 Do you know when will 0.19.2 be released?

 Thanks.




 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422



Re: Num map task?

2009-04-24 Thread jason hadoop
Unless the argument (args[0]) to your job is a comma separated set of paths,
you are only adding a single input path. It may be you want to pass args and
not args[0].
 FileInputFormat.setInputPaths(c, args[0]);

On Thu, Apr 23, 2009 at 7:10 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote:

 Edward J. Yoon wrote:

  As far as I know, FileInputFormat.getSplits() will returns the number
  of splits automatically computed by the number of files, blocks. BTW,
  What version of Hadoop/Hbase?
 
  I tried to test that code
  (http://wiki.apache.org/hadoop/Hbase/MapReduce) on my cluster (Hadoop
  0.19.1 and Hbase 0.19.0). The number of input paths was 2, map tasks
  were 274.
 
  Below is my changed code for v0.19.0.
  ---
public JobConf createSubmittableJob(String[] args) {
  JobConf c = new JobConf(getConf(), TestImport.class);
  c.setJobName(NAME);
  FileInputFormat.setInputPaths(c, args[0]);
 
  c.set(input.table, args[1]);
  c.setMapperClass(InnerMap.class);
  c.setNumReduceTasks(0);
  c.setOutputFormat(NullOutputFormat.class);
  return c;
}
 
 
 
  On Thu, Apr 23, 2009 at 6:19 PM, nguyenhuynh.mr
  nguyenhuynh...@gmail.com wrote:
 
  Edward J. Yoon wrote:
 
 
  How do you to add input paths?
 
  On Wed, Apr 22, 2009 at 5:09 PM, nguyenhuynh.mr
  nguyenhuynh...@gmail.com wrote:
 
 
  Edward J. Yoon wrote:
 
 
 
  Hi,
 
  In that case, The atomic unit of split is a file. So, you need to
  increase the number of files. or Use the TextInputFormat as below.
 
  jobConf.setInputFormat(TextInputFormat.class);
 
  On Wed, Apr 22, 2009 at 4:35 PM, nguyenhuynh.mr
  nguyenhuynh...@gmail.com wrote:
 
 
 
  Hi all!
 
 
  I have a MR job use to import contents into HBase.
 
  The content is text file in HDFS. I used the maps file to store
 local
  path of contents.
 
  Each content has the map file. ( the map is a text file in HDFS and
  contain 1 line info).
 
 
  I created the maps directory used to contain map files. And the
  this
  maps directory used to input path for job.
 
  When i run job, the number map task is same number map files.
  Ex: I have 5 maps file - 5 map tasks.
 
  Therefor, the map phase is slowly :(
 
  Why the map phase is slowly if the number map task large and the
 number
  map task is equal number of files?.
 
  * p/s: Run jobs with: 3 node: 1 server and 2 slaver
 
  Please help me!
  Thanks.
 
  Best,
  Nguyen.
 
 
 
 
 
 
 
 
  Current, I use TextInputformat to set InputFormat for map phase.
 
 
 
 
  Thanks for your help!
 
  I use FileInputFormat to add input paths.
  Some thing like:
 FileInputFormat.setInputPath(new Path(dir));
 
  The dir is a directory contains input files.
 
  Best,
  Nguyen
 
 
 
 
 Thanks!

 I am using Hadoop version 0.18.2

 Cheer,
 Nguyen.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread Aaron Kimball
In general, there is no way to do an automated downgrade of HDFS metadata :\
If you're up an HDFS version, I'm afraid you're stuck there. The only real
way to downgrade requires that you have enough free space to distcp from one
cluster to the other. If you have 100 TB of free space (!!) then that's
easy, if time-consuming. If not, then you'll have to be a bit more clever.
e.g., by downreplicating all the files first and storing them in
downreplicated fashion on the destination hdfs instance until you're done,
then upping the replication on the destination cluster after the source
cluster has been drained.

Waiting for 0.19.2 might be the better call here.

- Aaron

On Fri, Apr 24, 2009 at 3:12 PM, Zhou, Yunqing azure...@gmail.com wrote:

 But there are already 100TB data stored on DFS.
 Is there a safe solution to do such a downgrade?

 On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop jason.had...@gmail.com
 wrote:
  You could try the cloudera release based on 18.3, with many backported
  features.
  http://www.cloudera.com/distribution
 
  On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com
 wrote:
 
  currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
  and I found 0.19.1 is buggy and I have already applied some patches on
  hadoop jira to solve problems.
  But I'm looking forward to a more stable release of hadoop.
  Do you know when will 0.19.2 be released?
 
  Thanks.
 
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
 



Re: could only be replicated to 0 nodes, instead of 1

2009-04-24 Thread Piotr
Hi

I have got a very similar problem when trying to configure HDFS.
The solution was configuring a smaller block size.
 I wanted to install HDFS for testing purposes only, so decided to have ~300
MB of storage space on each machine. The block size was set up to 128 MB ( I
used cloudera configuration tool).
After changing the block size to 1 MB ( could be bigger but it is not a
production environment ), everything started to work fine !

regards
Piotr Praczyk


Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread jason hadoop
You could set up a parallel 18.3 HDFS on your cluster, and move files in
steps between the two.
You would only need the extra storage for the files in transition.
Your jobs would have to stop for the duration of the copy.
Manual labor and time consumptive :(

On Fri, Apr 24, 2009 at 12:11 AM, Aaron Kimball aa...@cloudera.com wrote:

 In general, there is no way to do an automated downgrade of HDFS metadata
 :\
 If you're up an HDFS version, I'm afraid you're stuck there. The only real
 way to downgrade requires that you have enough free space to distcp from
 one
 cluster to the other. If you have 100 TB of free space (!!) then that's
 easy, if time-consuming. If not, then you'll have to be a bit more clever.
 e.g., by downreplicating all the files first and storing them in
 downreplicated fashion on the destination hdfs instance until you're done,
 then upping the replication on the destination cluster after the source
 cluster has been drained.

 Waiting for 0.19.2 might be the better call here.

 - Aaron

 On Fri, Apr 24, 2009 at 3:12 PM, Zhou, Yunqing azure...@gmail.com wrote:

  But there are already 100TB data stored on DFS.
  Is there a safe solution to do such a downgrade?
 
  On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop jason.had...@gmail.com
  wrote:
   You could try the cloudera release based on 18.3, with many backported
   features.
   http://www.cloudera.com/distribution
  
   On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com
  wrote:
  
   currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB
 data.
   and I found 0.19.1 is buggy and I have already applied some patches on
   hadoop jira to solve problems.
   But I'm looking forward to a more stable release of hadoop.
   Do you know when will 0.19.2 be released?
  
   Thanks.
  
  
  
  
   --
   Alpha Chapters of my book on Hadoop are available
   http://www.apress.com/book/view/9781430219422
  
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Using the Stanford NLP with hadoop

2009-04-24 Thread Stuart Sierra
On Tue, Apr 21, 2009 at 4:58 PM, Kevin Peterson kpeter...@biz360.com wrote:
 I'm interested to know if you have found any other open source parsers in
 Java or at least have java bindings.

Stanford is one of the best, although it is slow.  LingPipe
http://alias-i.com/lingpipe/ is free for non-commercial use, and
they link to most of the open-source toolkits here:
http://alias-i.com/lingpipe/web/competition.html  It seems like most
NLP toolkits don't attempt full sentence parsing, but instead focus on
tagging, chunking, or entity recognition.

-Stuart


Advice on restarting HDFS in a cron

2009-04-24 Thread Marc Limotte
Hi.

I've heard that HDFS starts to slow down after it's been running for a long 
time.  And I believe I've experienced this.   So, I was thinking to set up a 
cron job to execute every week to shutdown HDFS and start it up again.

In concept, it would be something like:

0 0 0 0 0 $HADOOP_HOME/bin/stop-dfs.sh; $HADOOP_HOME/bin/start-dfs.sh

But I'm wondering if there is a safer way to do this.  In particular:

* What if a map/reduce job is running when this cron hits.  Is there a 
way to suspend jobs while the HDFS restart happens?

* Should I also restart the mapred daemons?

* Should I wait some time after stop-dfs.sh for things to settle 
down, before executing start-dfs.sh?  Or maybe I should run a command to 
verify that it is stopped before I run the start?

Thanks for any help.
Marc


PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY 
THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION 
PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, 
DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY PROHIBITED. 
PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE 
THIS MESSAGE FROM YOUR SYSTEM.


Re: Writing a New Aggregate Function

2009-04-24 Thread Runping Qi
A couple of general goals behind of the aggregate package:

1. If you are application developers using aggregate package, you only need
to develop your own (user defined) valuator descriptor classes, which are
typically sub class of ValueAggregatorDescriptor. You can use
the existing aggregator types (such as  LongValueSum, ValueHistogram, etc.)

2. If you want to contribute new types of aggregator (for example, an
ValueAverage class that keeps track the average of values will be a much
needed one), then you need to implement a class tham implements
ValueAggregator class, and to update the generateValueAggregator method of
ValueAggregatorBaseDescriptor to handle your new aggregators.

3. If you want to contribute to the aggregate framework itsself, you may
need to touch every bit of the code in the package.

Runping



On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein dmilst...@hubteam.com wrote:

 Hello all,

 I've been using streaming + the aggregate package (available via -reducer
 aggregate), and have been very happy with what it gives me.

 I'm interested in writing my own new aggregate functions (in Java) which I
 could then access from my streaming code.

 Can anyone give me pointers towards how to make that happen?  I've read
 through the aggregate package source, but I'm not seeing how to define my
 own, and get access to it from streaming.

 To be specific, here's the sort of thing I'd like to be able to do:

  - In Java, define a SampleValues aggregator, which chooses a sample of the
 input given to it

  - From my streaming program, in say python, output:

 SampleValues:some_key \t some_value

  - Have the aggregate framework somehow call my new aggregator for the
 combiner and reducer steps

 Thanks,
 -Dan Milstein



Re: Advice on restarting HDFS in a cron

2009-04-24 Thread Allen Wittenauer



On 4/24/09 9:31 AM, Marc Limotte mlimo...@feeva.com wrote:
 I've heard that HDFS starts to slow down after it's been running for a long
 time.  And I believe I've experienced this.

We did an upgrade (== complete restart) of a 2000 node instance in ~20
minutes on Wednesday. I wouldn't really consider that 'slow', but YMMV.

I suspect people aren't running the secondary name node and therefore have
massively large edits file.  The name node appears slow on restart because
it has to apply the edits to the fsimage rather than having the secondary
keep it up to date.




Re: Writing a New Aggregate Function

2009-04-24 Thread Dan Milstein

Runping,

Thanks for the response.  A question about case (2) below, (which is,  
in fact, what I want to do):


 - Is there any way to do this without patching the code within the  
aggregator package?


It sure doesn't look like it, but just to make sure.

Thanks again,
-Dan M

On Apr 24, 2009, at 12:56 PM, Runping Qi wrote:


A couple of general goals behind of the aggregate package:

1. If you are application developers using aggregate package, you  
only need
to develop your own (user defined) valuator descriptor classes,  
which are

typically sub class of ValueAggregatorDescriptor. You can use
the existing aggregator types (such as  LongValueSum,  
ValueHistogram, etc.)


2. If you want to contribute new types of aggregator (for example, an
ValueAverage class that keeps track the average of values will be a  
much

needed one), then you need to implement a class tham implements
ValueAggregator class, and to update the generateValueAggregator  
method of

ValueAggregatorBaseDescriptor to handle your new aggregators.

3. If you want to contribute to the aggregate framework itsself, you  
may

need to touch every bit of the code in the package.

Runping



On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein  
dmilst...@hubteam.com wrote:



Hello all,

I've been using streaming + the aggregate package (available via - 
reducer

aggregate), and have been very happy with what it gives me.

I'm interested in writing my own new aggregate functions (in Java)  
which I

could then access from my streaming code.

Can anyone give me pointers towards how to make that happen?  I've  
read
through the aggregate package source, but I'm not seeing how to  
define my

own, and get access to it from streaming.

To be specific, here's the sort of thing I'd like to be able to do:

- In Java, define a SampleValues aggregator, which chooses a sample  
of the

input given to it

- From my streaming program, in say python, output:

SampleValues:some_key \t some_value

- Have the aggregate framework somehow call my new aggregator for the
combiner and reducer steps

Thanks,
-Dan Milstein





RE: Advice on restarting HDFS in a cron

2009-04-24 Thread Marc Limotte
Actually, I'm concerned about performance of map/reduce jobs for a long-running 
cluster.  I.e. it seems to get slower the longer it's running.  After a restart 
of HDFS, the jobs seems to run faster.  Not concerned about the start-up time 
of HDFS.

Of course, as you suggest, this could be poor configuration of the cluster on 
my part; but I'd still like to hear best practices around doing a scheduled 
restart.

Marc

-Original Message-
From: Allen Wittenauer [mailto:a...@yahoo-inc.com]
Sent: Friday, April 24, 2009 10:17 AM
To: core-user@hadoop.apache.org
Subject: Re: Advice on restarting HDFS in a cron




On 4/24/09 9:31 AM, Marc Limotte mlimo...@feeva.com wrote:
 I've heard that HDFS starts to slow down after it's been running for a long
 time.  And I believe I've experienced this.

We did an upgrade (== complete restart) of a 2000 node instance in ~20
minutes on Wednesday. I wouldn't really consider that 'slow', but YMMV.

I suspect people aren't running the secondary name node and therefore have
massively large edits file.  The name node appears slow on restart because
it has to apply the edits to the fsimage rather than having the secondary
keep it up to date.


-Original Message-
From: Marc Limotte

Hi.

I've heard that HDFS starts to slow down after it's been running for a long 
time.  And I believe I've experienced this.   So, I was thinking to set up a 
cron job to execute every week to shutdown HDFS and start it up again.

In concept, it would be something like:

0 0 0 0 0 $HADOOP_HOME/bin/stop-dfs.sh; $HADOOP_HOME/bin/start-dfs.sh

But I'm wondering if there is a safer way to do this.  In particular:

* What if a map/reduce job is running when this cron hits.  Is there a 
way to suspend jobs while the HDFS restart happens?

* Should I also restart the mapred daemons?

* Should I wait some time after stop-dfs.sh for things to settle 
down, before executing start-dfs.sh?  Or maybe I should run a command to 
verify that it is stopped before I run the start?

Thanks for any help.
Marc


PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY 
THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION 
PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, 
DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY PROHIBITED. 
PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE 
THIS MESSAGE FROM YOUR SYSTEM.


Re: Advice on restarting HDFS in a cron

2009-04-24 Thread Todd Lipcon
On Fri, Apr 24, 2009 at 11:18 AM, Marc Limotte mlimo...@feeva.com wrote:

 Actually, I'm concerned about performance of map/reduce jobs for a
 long-running cluster.  I.e. it seems to get slower the longer it's running.
  After a restart of HDFS, the jobs seems to run faster.  Not concerned about
 the start-up time of HDFS.


Hi Marc,

Does it sound like this JIRA describes your problem?

https://issues.apache.org/jira/browse/HADOOP-4766

If so, restarting just the JT should help with the symptoms. (I say symptoms
because this is clearly a problem! Hadoop should be stable and performant
for months without a cluster restart!)

-Todd



 Of course, as you suggest, this could be poor configuration of the cluster
 on my part; but I'd still like to hear best practices around doing a
 scheduled restart.

 Marc

 -Original Message-
 From: Allen Wittenauer [mailto:a...@yahoo-inc.com]
 Sent: Friday, April 24, 2009 10:17 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Advice on restarting HDFS in a cron




 On 4/24/09 9:31 AM, Marc Limotte mlimo...@feeva.com wrote:
  I've heard that HDFS starts to slow down after it's been running for a
 long
  time.  And I believe I've experienced this.

 We did an upgrade (== complete restart) of a 2000 node instance in ~20
 minutes on Wednesday. I wouldn't really consider that 'slow', but YMMV.

 I suspect people aren't running the secondary name node and therefore have
 massively large edits file.  The name node appears slow on restart because
 it has to apply the edits to the fsimage rather than having the secondary
 keep it up to date.


 -Original Message-
 From: Marc Limotte

 Hi.

 I've heard that HDFS starts to slow down after it's been running for a long
 time.  And I believe I've experienced this.   So, I was thinking to set up a
 cron job to execute every week to shutdown HDFS and start it up again.

 In concept, it would be something like:

 0 0 0 0 0 $HADOOP_HOME/bin/stop-dfs.sh; $HADOOP_HOME/bin/start-dfs.sh

 But I'm wondering if there is a safer way to do this.  In particular:

 * What if a map/reduce job is running when this cron hits.  Is
 there a way to suspend jobs while the HDFS restart happens?

 * Should I also restart the mapred daemons?

 * Should I wait some time after stop-dfs.sh for things to settle
 down, before executing start-dfs.sh?  Or maybe I should run a command to
 verify that it is stopped before I run the start?

 Thanks for any help.
 Marc


 PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR
 ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION
 PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE,
 DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
 PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND
 PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.



Re: Writing a New Aggregate Function

2009-04-24 Thread Runping Qi
You are right; you have to patch the code  in the aggregate package.


On Fri, Apr 24, 2009 at 10:24 AM, Dan Milstein dmilst...@hubteam.comwrote:

 Runping,

 Thanks for the response.  A question about case (2) below, (which is, in
 fact, what I want to do):

  - Is there any way to do this without patching the code within the
 aggregator package?

 It sure doesn't look like it, but just to make sure.

 Thanks again,
 -Dan M


 On Apr 24, 2009, at 12:56 PM, Runping Qi wrote:

  A couple of general goals behind of the aggregate package:

 1. If you are application developers using aggregate package, you only
 need
 to develop your own (user defined) valuator descriptor classes, which are
 typically sub class of ValueAggregatorDescriptor. You can use
 the existing aggregator types (such as  LongValueSum, ValueHistogram,
 etc.)

 2. If you want to contribute new types of aggregator (for example, an
 ValueAverage class that keeps track the average of values will be a much
 needed one), then you need to implement a class tham implements
 ValueAggregator class, and to update the generateValueAggregator method of
 ValueAggregatorBaseDescriptor to handle your new aggregators.

 3. If you want to contribute to the aggregate framework itsself, you may
 need to touch every bit of the code in the package.

 Runping



 On Thu, Apr 23, 2009 at 1:44 PM, Dan Milstein dmilst...@hubteam.com
 wrote:

  Hello all,

 I've been using streaming + the aggregate package (available via -reducer
 aggregate), and have been very happy with what it gives me.

 I'm interested in writing my own new aggregate functions (in Java) which
 I
 could then access from my streaming code.

 Can anyone give me pointers towards how to make that happen?  I've read
 through the aggregate package source, but I'm not seeing how to define my
 own, and get access to it from streaming.

 To be specific, here's the sort of thing I'd like to be able to do:

 - In Java, define a SampleValues aggregator, which chooses a sample of
 the
 input given to it

 - From my streaming program, in say python, output:

 SampleValues:some_key \t some_value

 - Have the aggregate framework somehow call my new aggregator for the
 combiner and reducer steps

 Thanks,
 -Dan Milstein





HDFS files naming convention

2009-04-24 Thread Parul Kudtarkar

The HDFS files generated after mapreduce run are strored in HDFS as
part-0 and so on.part-n

Is it possible to name these output files stored in HDFS as per my own
convention i.e. I would like to name these files my_file_1 and so
onmy_files_n

Please advice how this can be achieved?

Thanks,
Parul V. Kudtarkar
-- 
View this message in context: 
http://www.nabble.com/HDFS-files-naming-convention-tp23223348p23223348.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Copying files from HDFS to remote database

2009-04-24 Thread Parul Kudtarkar

Thanks Mr. Borthakur.

Parul V. Kudtarkar

Dhruba Borthakur-2 wrote:
 
 You can use any of these:
 
 1. bin/hadoop dfs -get hdfsfile remote filename
 2. Thrift API : http://wiki.apache.org/hadoop/HDFS-APIs
 3. use fuse-mount ot mount hdfs as a regular file system on remote
 machine:
 http://wiki.apache.org/hadoop/MountableHDFS
 
 thanks,
 dhruba
 
 
 
 On Mon, Apr 20, 2009 at 9:40 PM, Parul Kudtarkar 
 parul_kudtar...@hms.harvard.edu wrote:
 

 Our application is using hadoop to parallelize jobs across ec2 cluster.
 HDFS
 is used to store output files. How would you ideally copy output files
 from
 HDFS to remote databases?

 Thanks,
 Parul V. Kudtarkar
 --
 View this message in context:
 http://www.nabble.com/Copying-files-from-HDFS-to-remote-database-tp23149085p23149085.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Re%3A-Copying-files-from-HDFS-to-remote-database-tp23163462p23223847.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Processing High CPU Memory intensive tasks on Hadoop - Architecture question

2009-04-24 Thread amit handa
Hi,

We are planning to use hadoop for some very expensive and long running
processing tasks.
The computing nodes that we plan to use are very heavy in terms of CPU and
memory requirement e.g one process instance takes almost 100% CPU (1 core)
and around 300 -400 MB of RAM.
The first time the process loads it can take around 1-1:30 minutes but after
that we can provide the data to process and it takes few seconds to process.
Can I model it on hadoop ?
Can I have my processes pre-loaded on the task processing machines and the
data be provided by hadoop? This will save the 1-1:30 minutes of intial load
time that it would otherwise take for each task.
I want to run a number of these processes in parallel  based on the machines
capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.

Please let me know if this is possible or any pointers to how it can be done
?

Thanks,
Amit