Re: why does my mapper class reads my input file twice?

2012-03-05 Thread Harsh J
Its your use of the mapred.input.dir property, which is a reserved
name in the framework (its what FileInputFormat uses).

You have a config you extract path from:
Path input = new Path(conf.get("mapred.input.dir"));

Then you do:
FileInputFormat.addInputPath(job, input);

Which internally, simply appends a path to a config prop called
"mapred.input.dir". Hence your job gets launched with two input files
(the very same) - one added by default Tool-provided configuration
(cause of your -Dmapred.input.dir) and the other added by you.

Fix the input path line to use a different config:
Path input = new Path(conf.get("input.path"));

And run job as:
hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
-Dmapred.output.dir=result

On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne  wrote:
> i have code that reads in a text file. i notice that each line in the text
> file is somehow being read twice. why is this happening?
>
> my mapper class looks like the following:
>
> public class MyMapper extends Mapper Text> {
>
> private static final Log _log = LogFactory.getLog(MyMapper.class);
>  @Override
> public void map(LongWritable key, Text value, Context context) throws
> IOException, InterruptedException {
> String s = (new
> StringBuilder()).append(value.toString()).append("m").toString();
> context.write(key, new Text(s));
> _log.debug(key.toString() + " => " + s);
> }
> }
>
> my reducer class looks like the following:
>
> public class MyReducer extends Reducer Text> {
>
> private static final Log _log = LogFactory.getLog(MyReducer.class);
>  @Override
> public void reduce(LongWritable key, Iterable values, Context
> context) throws IOException, InterruptedException {
> for(Iterator it = values.iterator(); it.hasNext();) {
> Text txt = it.next();
> String s = (new
> StringBuilder()).append(txt.toString()).append("r").toString();
> context.write(key, new Text(s));
> _log.debug(key.toString() + " => " + s);
> }
> }
> }
>
> my job class looks like the following:
>
> public class MyJob extends Configured implements Tool {
>
> public static void main(String[] args) throws Exception {
> ToolRunner.run(new Configuration(), new MyJob(), args);
> }
>
> @Override
> public int run(String[] args) throws Exception {
> Configuration conf = getConf();
> Path input = new Path(conf.get("mapred.input.dir"));
>    Path output = new Path(conf.get("mapred.output.dir"));
>
>    Job job = new Job(conf, "dummy job");
>    job.setMapOutputKeyClass(LongWritable.class);
>    job.setMapOutputValueClass(Text.class);
>    job.setOutputKeyClass(LongWritable.class);
>    job.setOutputValueClass(Text.class);
>
>    job.setMapperClass(MyMapper.class);
>    job.setReducerClass(MyReducer.class);
>
>    FileInputFormat.addInputPath(job, input);
>    FileOutputFormat.setOutputPath(job, output);
>
>    job.setJarByClass(MyJob.class);
>
>    return job.waitForCompletion(true) ? 0 : 1;
> }
> }
>
> the text file that i am trying to read in looks like the following. as you
> can see, there are 9 lines.
>
> T, T
> T, T
> T, T
> F, F
> F, F
> F, F
> F, F
> T, F
> F, T
>
> the output file that i get after my Job runs looks like the following. as
> you can see, there are 18 lines. each key is emitted twice from the mapper
> to the reducer.
>
> 0   T, Tmr
> 0   T, Tmr
> 6   T, Tmr
> 6   T, Tmr
> 12  T, Tmr
> 12  T, Tmr
> 18  F, Fmr
> 18  F, Fmr
> 24  F, Fmr
> 24  F, Fmr
> 30  F, Fmr
> 30  F, Fmr
> 36  F, Fmr
> 36  F, Fmr
> 42  T, Fmr
> 42  T, Fmr
> 48  F, Tmr
> 48  F, Tmr
>
> the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
>
> hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
> -Dmapred.output.dir=result
>
> originally, this happened when i read in a sequence file, but even for a
> text file, this problem is still happening. is it the way i have setup my
> Job?



-- 
Harsh J


hadoop 1.0 / HOD or CloneZilla?

2012-03-05 Thread Masoud

Hi all,

I have experience with hadoop 0.20.204 on 3 machines cluster as pilot, 
now im trying to setup real cluster on 32 linux machines.

I have some question:

1. is hadoop 1.0 stable?? in hadoop site this version is indicated as
   beta release

2. as you know installing and setting up hadoop in all 32 machines
   separately in not good idea, so what can i do?
1. using hadoop on demand (HOD)?
2. or using OS image replicate tools same as clozeZilla? i think
   this method is better because in addition to hadoop I can clone
   same other settings such as SSH or Samba in all machines.

Let me know your idea,

B.S,
Masoud.



Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
Sorry for multiple emails. I did find:


2012-03-05 17:26:35,636 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
Usage threshold init = 715849728(699072K) used = 575921696(562423K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:35,719 INFO
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
7816154 bytes from 1 objects. init = 715849728(699072K) used =
575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,881 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 715849728(699072K) used = 358720384(350312K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space

at java.nio.HeapCharBuffer.(HeapCharBuffer.java:39)

at java.nio.CharBuffer.allocate(CharBuffer.java:312)

at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)

at org.apache.hadoop.io.Text.decode(Text.java:350)

at org.apache.hadoop.io.Text.decode(Text.java:327)

at org.apache.hadoop.io.Text.toString(Text.java:254)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)

at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)

at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

at org.apache.hadoop.mapred.Child.main(Child.java:264)


On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia wrote:

> All I see in the logs is:
>
>
> 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
> attempt_201203051722_0001_m_30_1 - Killed : Java heap space
>
> Looks like task tracker is killing the tasks. Not sure why. I increased
> heap from 512 to 1G and still it fails.
>
>
> On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia wrote:
>
>> I currently have java.opts.mapred set to 512MB and I am getting heap
>> space errors. How should I go about debugging heap space issues?
>>
>
>


Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
All I see in the logs is:


2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
attempt_201203051722_0001_m_30_1 - Killed : Java heap space

Looks like task tracker is killing the tasks. Not sure why. I increased
heap from 512 to 1G and still it fails.


On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia wrote:

> I currently have java.opts.mapred set to 512MB and I am getting heap space
> errors. How should I go about debugging heap space issues?
>


Re: OutOfMemoryError: unable to create new native thread

2012-03-05 Thread Clay Chiang
Hi Rohini,

   The similar problem was just encountered for me yesterday. But for my
situation, the max process num (ulimit -u) is set to 1024, which is too
small. And when i increase it to 100, the problem gone.  But u said
"Ulimit on the machine is set to unlimited",  i'm not sure this will help
or not :)

   And also check about `cat /proc/sys/kernel/threads-max', this seems to
be a system-wide setting for total number of threads.


On Tue, Mar 6, 2012 at 4:30 AM, Rohini U  wrote:

> Hi All,
>
> I am running a map reduce job that uses around 120 MB of data and I get
> this out of memory error.  Ulimit on the machine is set to unlimited.  Any
> ideas on how to fix this?
> The stack trace is as given below:
>
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: java.lang.OutOfMemoryError: unable to create new
> native thread
>at java.lang.Thread.start0(Native Method)
>at java.lang.Thread.start(Thread.java:597)
>at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553)
>at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317)
>at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297)
>at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289)
>at
> org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158)
>at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:782)
>at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2938)
>at
> org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:2910)
>at
> org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:1974)
>at
> org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3327)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
>
>at org.apache.hadoop.ipc.Client.call(Client.java:1107)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>at $Proxy0.fatalError(Unknown Source)
>at org.apache.hadoop.mapred.Child.main(Child.java:325)
>
>
>
> Thanks
> -Rohini
>



-- 
Kindest Regards,
Clay Chiang


Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Unfortunately, "public" didn't change my error ... Any other ideas? Has
anyone ran Hadoop on eclipse with custom sequence inputs ?

Thank you,
Mark

On Mon, Mar 5, 2012 at 9:58 AM, Mark question  wrote:

> Hi Madhu, it has the following line:
>
> TermDocFreqArrayWritable () {}
>
> but I'll try it with "public" access in case it's been called outside of
> my package.
>
> Thank you,
> Mark
>
>
> On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak  wrote:
>
>> Hi,
>>  Please make sure that your CustomWritable has a default constructor.
>>
>> On Sat, Mar 3, 2012 at 4:56 AM, Mark question 
>> wrote:
>>
>> > Hello,
>> >
>> >   I'm trying to debug my code through eclipse, which worked fine with
>> > given Hadoop applications (eg. wordcount), but as soon as I run it on my
>> > application with my custom sequence input file/types, I get:
>> > Java.lang.runtimeException.java.ioException (Writable name can't load
>> > class)
>> > SequenceFile$Reader.getValeClass(Sequence File.class)
>> >
>> > because my valueClass is customed. In other words, how can I add/build
>> my
>> > CustomWritable class to be with hadoop LongWritable,IntegerWritable 
>> > etc.
>> >
>> > Did anyone used eclipse?
>> >
>> > Mark
>> >
>>
>>
>>
>> --
>> Join me at http://hadoopworkshop.eventbrite.com/
>>
>
>


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Russell Jurney
Streaming is good for simulation. Long running map-only processes, where pig 
doesn't really help and it is simple to fire off a streaming process.  You do 
have to set some options so they can take a long time to return/return counters.

Russell Jurney http://datasyndrome.com

On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn  wrote:

> I'm really interested in this as well. I have trouble seeing a really good 
> use case for streaming map-reduce. Is there something I can do in streaming 
> that I can't do in Pig? If I want to re-use previously made Python functions 
> from my code base, I can do that in Pig as much as Streaming, and from what 
> I've experienced thus far, Python streaming seems to go slower than or at the 
> same speed as Pig, so why would I want to write a whole lot of 
> more-difficult-to-read mappers and reducers when I can do equally fast 
> performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but 
> currently I just can't think of the right use case.
> 
> Eli
> 
> On 3/2/12 9:21 AM, Subir S wrote:
>> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:
>> 
>>> On Fri, Mar 2, 2012 at 10:18 AM, Subir S
>>> wrote:
 Hello Folks,
 
 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?
>>> I do not see why you seek to compare these two. Pig offers a language
>>> that lets you write data-flow operations and runs these statements as
>>> a series of MR jobs for you automatically (Making it a great tool to
>>> use to get data processing done really quick, without bothering with
>>> code), while streaming is something you use to write non-Java, simple
>>> MR jobs. Both have their own purposes.
>>> 
>> Basically we are comparing these two to see the benefits and how much they
>> help in improving the productive coding time, without jeopardizing the
>> performance of MR jobs.
>> 
>> 
 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available
>>> Pig _runs_ MR jobs. It does do job design (and some data)
>>> optimizations based on your queries, which is what may give it an edge
>>> over designing elaborate flows of plain MR jobs with tools like
>>> Oozie/JobControl (Which takes more time to do). But regardless, Pig
>>> only makes it easy doing the same thing with Pig Latin statements for
>>> you.
>>> 
>> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
>> pretty slow with lot of joins, which we can achieve faster with writing raw
>> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
>> for example what kind of projects should consider Pig. Say when we have a
>> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>> 
>> Thank you Harsh for your comments. They are helpful!
>> 
>> 
>>> --
>>> Harsh J
>>> 
> 


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Eli Finkelshteyn
I'm really interested in this as well. I have trouble seeing a really 
good use case for streaming map-reduce. Is there something I can do in 
streaming that I can't do in Pig? If I want to re-use previously made 
Python functions from my code base, I can do that in Pig as much as 
Streaming, and from what I've experienced thus far, Python streaming 
seems to go slower than or at the same speed as Pig, so why would I want 
to write a whole lot of more-difficult-to-read mappers and reducers when 
I can do equally fast performance-wise, shorter, and clearer code in 
Pig? Maybe it's obvious, but currently I just can't think of the right 
use case.


Eli

On 3/2/12 9:21 AM, Subir S wrote:

On Fri, Mar 2, 2012 at 12:38 PM, Harsh J  wrote:


On Fri, Mar 2, 2012 at 10:18 AM, Subir S
wrote:

Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.


Basically we are comparing these two to see the benefits and how much they
help in improving the productive coding time, without jeopardizing the
performance of MR jobs.



Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.


I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
pretty slow with lot of joins, which we can achieve faster with writing raw
MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
for example what kind of projects should consider Pig. Say when we have a
lot of Joins, which writing with plain MR jobs takes time. Thoughts?

Thank you Harsh for your comments. They are helpful!



--
Harsh J





Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Hi Madhu, it has the following line:

TermDocFreqArrayWritable () {}

but I'll try it with "public" access in case it's been called outside of my
package.

Thank you,
Mark

On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak  wrote:

> Hi,
>  Please make sure that your CustomWritable has a default constructor.
>
> On Sat, Mar 3, 2012 at 4:56 AM, Mark question  wrote:
>
> > Hello,
> >
> >   I'm trying to debug my code through eclipse, which worked fine with
> > given Hadoop applications (eg. wordcount), but as soon as I run it on my
> > application with my custom sequence input file/types, I get:
> > Java.lang.runtimeException.java.ioException (Writable name can't load
> > class)
> > SequenceFile$Reader.getValeClass(Sequence File.class)
> >
> > because my valueClass is customed. In other words, how can I add/build my
> > CustomWritable class to be with hadoop LongWritable,IntegerWritable 
> > etc.
> >
> > Did anyone used eclipse?
> >
> > Mark
> >
>
>
>
> --
> Join me at http://hadoopworkshop.eventbrite.com/
>


Re: AWS MapReduce

2012-03-05 Thread Mohit Anchlia
On Mon, Mar 5, 2012 at 7:40 AM, John Conwell  wrote:

> AWS MapReduce (EMR) does not use S3 for its HDFS persistance.  If it did
> your S3 billing would be massive :)  EMR reads all input jar files and
> input data from S3, but it copies these files down to its local disk.  It
> then does starts the MR process, doing all HDFS reads and writes to the
> local disks.  At the end of the MR job, it copies the MR job output and all
> process logs to S3, and then tears down the VM instances.
>
> You can see this for yourself if you spin up a small EMR cluster, but turn
> off the configuration flag that kills the VMs at the end if the MR job.
>  Then look at the hadoop configuration files to see how hadoop is
> configured.
>
> I really like EMR.  Amazon  has done a lot of work to optimize the hadoop
> configurations and VM instance AMIs to execute MR jobs fairly efficiently
> on a VM cluster.  I had to do a lot of (expensive) trial and error work to
> figure out an optimal hadoop / VM configuration to run our MR jobs without
> crashing / timing out the jobs.  The only reason we didnt standardize on
> EMR was that it strongly bound your code base / process to using EMR for
> hadoop processing, vs a flexible infrastructure that could use a local
> cluster or cluster on a different cloud provider.
>
> Thanks for your input. I am assuming HDFS is created on ephemerial disks
and not EBS. Also, is it possible to share some of your findings?

>
> On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia  >wrote:
>
> > As far as I see in the docs it looks like you could also use hdfs instead
> > of s3. But what I am not sure is if these are local disks or EBS.
> >
> > On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer <
> > hannesc...@googlemail.com
> > > wrote:
> >
> > > Hi,
> > >
> > > yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
> > > The setup is done pretty fast and there are some configuration
> parameters
> > > you can bypass - for example blocksizes etc. - but in the end imho
> > setting
> > > up ec2 instances by copying images is the better alternative.
> > >
> > > Kind Regards
> > >
> > > Hannes
> > >
> > > On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia  > > >wrote:
> > >
> > > > I think found answer to this question. However, it's still not clear
> if
> > > > HDFS is on local disk or EBS volumes. Does anyone know?
> > > >
> > > > On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > > >wrote:
> > > >
> > > > > Just want to check  how many are using AWS mapreduce and understand
> > the
> > > > > pros and cons of Amazon's MapReduce machines? Is it true that these
> > map
> > > > > reduce machines are really reading and writing from S3 instead of
> > local
> > > > > disks? Has anyone found issues with Amazon MapReduce and how does
> it
> > > > > compare with using MapReduce on local attached disks compared to
> > using
> > > > S3.
> > > >
> > >
> > > ---
> > > www.informera.de
> > > Hadoop & Big Data Services
> > >
> >
>
>
>
> --
>
> Thanks,
> John C
>


Re: AWS MapReduce

2012-03-05 Thread John Conwell
AWS MapReduce (EMR) does not use S3 for its HDFS persistance.  If it did
your S3 billing would be massive :)  EMR reads all input jar files and
input data from S3, but it copies these files down to its local disk.  It
then does starts the MR process, doing all HDFS reads and writes to the
local disks.  At the end of the MR job, it copies the MR job output and all
process logs to S3, and then tears down the VM instances.

You can see this for yourself if you spin up a small EMR cluster, but turn
off the configuration flag that kills the VMs at the end if the MR job.
 Then look at the hadoop configuration files to see how hadoop is
configured.

I really like EMR.  Amazon  has done a lot of work to optimize the hadoop
configurations and VM instance AMIs to execute MR jobs fairly efficiently
on a VM cluster.  I had to do a lot of (expensive) trial and error work to
figure out an optimal hadoop / VM configuration to run our MR jobs without
crashing / timing out the jobs.  The only reason we didnt standardize on
EMR was that it strongly bound your code base / process to using EMR for
hadoop processing, vs a flexible infrastructure that could use a local
cluster or cluster on a different cloud provider.


On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia wrote:

> As far as I see in the docs it looks like you could also use hdfs instead
> of s3. But what I am not sure is if these are local disks or EBS.
>
> On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer <
> hannesc...@googlemail.com
> > wrote:
>
> > Hi,
> >
> > yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
> > The setup is done pretty fast and there are some configuration parameters
> > you can bypass - for example blocksizes etc. - but in the end imho
> setting
> > up ec2 instances by copying images is the better alternative.
> >
> > Kind Regards
> >
> > Hannes
> >
> > On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia  > >wrote:
> >
> > > I think found answer to this question. However, it's still not clear if
> > > HDFS is on local disk or EBS volumes. Does anyone know?
> > >
> > > On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia  > > >wrote:
> > >
> > > > Just want to check  how many are using AWS mapreduce and understand
> the
> > > > pros and cons of Amazon's MapReduce machines? Is it true that these
> map
> > > > reduce machines are really reading and writing from S3 instead of
> local
> > > > disks? Has anyone found issues with Amazon MapReduce and how does it
> > > > compare with using MapReduce on local attached disks compared to
> using
> > > S3.
> > >
> >
> > ---
> > www.informera.de
> > Hadoop & Big Data Services
> >
>



-- 

Thanks,
John C


Re: Setting up Hadoop single node setup on Mac OS X

2012-03-05 Thread John Armstrong

On 02/27/2012 11:53 AM, W.P. McNeill wrote:

You don't need any virtualization. Mac OS X is Linux and runs Hadoop as is.



Nitpick: OS X is NEXTSTEP based on Mach, which is a different 
POSIX-compliant system from Linux.


fairscheduler : group.name | Please edit patch to work for 0.20.205

2012-03-05 Thread Austin Chungath
Can someone have a look at the patch MAPREDUCE-2457 and see if it can be
modified to work for 0.20.205?
I am very new to java and have no idea what's going on in that patch. If
you have any pointers for me, I will see if I can do it on my own.

Thanks,
Austin

On Fri, Mar 2, 2012 at 7:15 PM, Austin Chungath  wrote:

> I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205.
> Are you sure this patch will work for 0.20.205?
> According to the description it says that the patch works for 0.21 and
> 0.22 and it says that 0.20 supports group.name without this patch...
>
> So does this patch also apply to 0.20.205?
>
> Thanks,
> Austin
>
>  On Thu, Mar 1, 2012 at 11:24 PM, Harsh J  wrote:
>
>> The group.name scheduler support was introduced in
>> https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
>> broken by the security changes present in 0.20.205. You'll need the
>> fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
>> to have group.name support.
>>
>> On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath 
>> wrote:
>> >  I am running fair scheduler on hadoop 0.20.205.0
>> >
>> > http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
>> > The above page talks about the following property
>> >
>> > *mapred.fairscheduler.poolnameproperty*
>> > **
>> > which I can set to *group.name*
>> > The default is user.name and when a user submits a job the fair
>> scheduler
>> > assigns each user's job to a pool which has the name of the user.
>> > I am trying to change it to group.name so that the job is submitted to
>> a
>> > pool which has the name of the user's linux group. Thus all jobs from
>> any
>> > user from a specific group go to the same pool instead of an individual
>> > pool for every user.
>> > But *group.name* doesn't seem to work, has anyone tried this before?
>> >
>> > *user.name* and *mapred.job.queue.name* works. Is group.name supported
>> in
>>  > 0.20.205.0 because I don't see it mentioned in the docs?
>> >
>> > Thanks,
>> > Austin
>>
>>
>>
>> --
>> Harsh J
>>
>
>