Re: different input/output formats

2012-05-29 Thread Mark question
Hi Samir, can you email me your main class.. or if you can check mine, it
is as follows:

public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf("Usage:bin/hadoop jar norm1.jar 
\n");
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName("SortDocByNorm1");
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
}


On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> Hi Mark
>See the out put for that same  Application .
>I am  not getting any error.
>
>
> On Wed, May 30, 2012 at 1:27 AM, Mark question wrote:
>
>> Hi guys, this is a very simple  program, trying to use TextInputFormat and
>> SequenceFileoutputFormat. Should be easy but I get the same error.
>>
>> Here is my configurations:
>>
>>conf.setMapperClass(myMapper.class);
>>conf.setMapOutputKeyClass(FloatWritable.class);
>>conf.setMapOutputValueClass(Text.class);
>>conf.setNumReduceTasks(0);
>>conf.setOutputKeyClass(FloatWritable.class);
>>conf.setOutputValueClass(Text.class);
>>
>>conf.setInputFormat(TextInputFormat.class);
>>conf.setOutputFormat(SequenceFileOutputFormat.class);
>>
>>TextInputFormat.addInputPath(conf, new Path(args[0]));
>>SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>
>>
>> myMapper class is:
>>
>> public class myMapper extends MapReduceBase implements
>> Mapper {
>>
>>public void map(LongWritable offset, Text
>> val,OutputCollector output, Reporter reporter)
>>throws IOException {
>>output.collect(new FloatWritable(1), val);
>> }
>> }
>>
>> But I get the following error:
>>
>> 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
>> attempt_201205260045_0032_m_00_0, Status : FAILED
>> java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
>> not class org.apache.hadoop.io.FloatWritable
>>at
>> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
>>at
>>
>> org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
>>at
>>
>> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
>>at
>>
>> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
>>at
>>
>> filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
>>at
>>
>> filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
>>at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
>>at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
>>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>>at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>>at java.security.AccessController.doPrivileged(Native Method)
>>at javax.security.auth.Subject.doAs(Subject.java:396)
>>at org.apache.hadoop.security.Use
>>
>> Where is the writing of LongWritable coming from ??
>>
>> Thank you,
>> Mark
>>
>
>


Re: different input/output formats

2012-05-29 Thread Mark question
Thanks for the reply but I already tried this option,  and is the error:

java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
not class org.apache.hadoop.io.FloatWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

Mark

On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> Hi  Mark
>
>  public void map(LongWritable offset, Text
> val,OutputCollector<
> FloatWritable,Text> output, Reporter reporter)
>   throws IOException {
>output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f
> then it will work.*
>}
>
> let me know the status after the change
>
>
> On Wed, May 30, 2012 at 1:27 AM, Mark question 
> wrote:
>
> > Hi guys, this is a very simple  program, trying to use TextInputFormat
> and
> > SequenceFileoutputFormat. Should be easy but I get the same error.
> >
> > Here is my configurations:
> >
> >conf.setMapperClass(myMapper.class);
> >conf.setMapOutputKeyClass(FloatWritable.class);
> >conf.setMapOutputValueClass(Text.class);
> >conf.setNumReduceTasks(0);
> >conf.setOutputKeyClass(FloatWritable.class);
> >conf.setOutputValueClass(Text.class);
> >
> >conf.setInputFormat(TextInputFormat.class);
> >conf.setOutputFormat(SequenceFileOutputFormat.class);
> >
> >TextInputFormat.addInputPath(conf, new Path(args[0]));
> >SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
> >
> >
> > myMapper class is:
> >
> > public class myMapper extends MapReduceBase implements
> > Mapper {
> >
> >public void map(LongWritable offset, Text
> > val,OutputCollector output, Reporter reporter)
> >throws IOException {
> >output.collect(new FloatWritable(1), val);
> > }
> > }
> >
> > But I get the following error:
> >
> > 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
> > attempt_201205260045_0032_m_00_0, Status : FAILED
> > java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable
> is
> > not class org.apache.hadoop.io.FloatWritable
> >at
> > org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
> >at
> >
> >
> org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
> >at
> >
> >
> org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
> >at
> >
> >
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
> >at
> >
> >
> filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
> >at
> >
> >
> filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
> >at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
> >at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> >at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> >at java.security.AccessController.doPrivileged(Native Method)
> >at javax.security.auth.Subject.doAs(Subject.java:396)
> >at org.apache.hadoop.security.Use
> >
> > Where is the writing of LongWritable coming from ??
> >
> > Thank you,
> > Mark
> >
>


Re: How to add debugging to map- red code

2012-04-20 Thread Mark question
I'm interested in this too, but could you tell me where to apply the patch
and is the following the right command to write it:

 
patch
< 
MAPREDUCE-336_0_20090818.patch

Thank you,
Mark

On Fri, Apr 20, 2012 at 8:28 AM, Harsh J  wrote:

> Yes this is possible, and there's two ways to do this.
>
> 1. Use a distro/release that carries the
> https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let
> you avoid work (see 2, which is same as your idea)
>
> 2. Configure your implementation's logger object's level in the
> setup/setConf methods of the task, by looking at some conf prop to
> decide the level. This will work just as well - and will also avoid
> changing Hadoop's own Child log levels, unlike the (1) method.
>
> On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn 
> wrote:
> > Hi,
> > I m trying to find out best way to add debugging in map- red code.
> > I have System.out.println() statements that I keep on commenting and
> uncommenting so as not to increase stdout size
> >
> > But problem is anytime I need debug, I Hv to re-compile.
> >
> > If there a way, I can define log levels using log4j in map-red code and
> define log level as conf option ?
> >
> > Thanks,
> > JJ
> >
> > Sent from my iPhone
>
>
>
> --
> Harsh J
>


Has anyone installed HCE and built it successfully?

2012-04-18 Thread Mark question
Hey guys, I've been stuck with HCE installation for two days now and can't
figure out the problem.

Errors I get from running (sh build.sh) is "can not execute binary file" .
I tried setting my JAVA_HOME and ANT_HOME manually and using the script
build.sh, no luck. So, please if you've used HCE could you share with me
your knowledge.

Thank you,
Mark


Re: Hadoop streaming or pipes ..

2012-04-06 Thread Mark question
Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*<http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5>
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl wrote:

> Also bear in mind that there is a kind of detour involved, in the sense
> that a pipes map must send key,value data back to the Java process and then
> to reduce (more or less).
> I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
> be faster.
> Would be interested to know if the community has any experience with HCE
> performance.
> C
>
> On Apr 5, 2012, at 3:49 PM, Robert Evans  wrote:
>
> > Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
> >
> > I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
> >
> > --Bobby Evans
> >
> >
> > On 4/5/12 1:54 PM, "Mark question"  wrote:
> >
> > Hi guys,
> >  quick question:
> >   Are there any performance gains from hadoop streaming or pipes over
> > Java? From what I've read, it's only to ease testing by using your
> favorite
> > language. So I guess it is eventually translated to bytecode then
> executed.
> > Is that true?
> >
> > Thank you,
> > Mark
> >
>


Re: Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question"  wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>


Hadoop streaming or pipes ..

2012-04-05 Thread Mark question
Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark


Hadoop pipes and streaming ..

2012-04-05 Thread Mark question
Hi guys,

   Two quick questions:
   1. Are there any performance gains from hadoop streaming or pipes ? As
far as I read, it is to ease testing using your favorite language. Which I
think implies that everything is translated to bytecode eventually and
executed.


Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Unfortunately, "public" didn't change my error ... Any other ideas? Has
anyone ran Hadoop on eclipse with custom sequence inputs ?

Thank you,
Mark

On Mon, Mar 5, 2012 at 9:58 AM, Mark question  wrote:

> Hi Madhu, it has the following line:
>
> TermDocFreqArrayWritable () {}
>
> but I'll try it with "public" access in case it's been called outside of
> my package.
>
> Thank you,
> Mark
>
>
> On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak  wrote:
>
>> Hi,
>>  Please make sure that your CustomWritable has a default constructor.
>>
>> On Sat, Mar 3, 2012 at 4:56 AM, Mark question 
>> wrote:
>>
>> > Hello,
>> >
>> >   I'm trying to debug my code through eclipse, which worked fine with
>> > given Hadoop applications (eg. wordcount), but as soon as I run it on my
>> > application with my custom sequence input file/types, I get:
>> > Java.lang.runtimeException.java.ioException (Writable name can't load
>> > class)
>> > SequenceFile$Reader.getValeClass(Sequence File.class)
>> >
>> > because my valueClass is customed. In other words, how can I add/build
>> my
>> > CustomWritable class to be with hadoop LongWritable,IntegerWritable 
>> > etc.
>> >
>> > Did anyone used eclipse?
>> >
>> > Mark
>> >
>>
>>
>>
>> --
>> Join me at http://hadoopworkshop.eventbrite.com/
>>
>
>


Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question
Hi Madhu, it has the following line:

TermDocFreqArrayWritable () {}

but I'll try it with "public" access in case it's been called outside of my
package.

Thank you,
Mark

On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak  wrote:

> Hi,
>  Please make sure that your CustomWritable has a default constructor.
>
> On Sat, Mar 3, 2012 at 4:56 AM, Mark question  wrote:
>
> > Hello,
> >
> >   I'm trying to debug my code through eclipse, which worked fine with
> > given Hadoop applications (eg. wordcount), but as soon as I run it on my
> > application with my custom sequence input file/types, I get:
> > Java.lang.runtimeException.java.ioException (Writable name can't load
> > class)
> > SequenceFile$Reader.getValeClass(Sequence File.class)
> >
> > because my valueClass is customed. In other words, how can I add/build my
> > CustomWritable class to be with hadoop LongWritable,IntegerWritable 
> > etc.
> >
> > Did anyone used eclipse?
> >
> > Mark
> >
>
>
>
> --
> Join me at http://hadoopworkshop.eventbrite.com/
>


Re: Streaming Hadoop using C

2012-03-01 Thread Mark question
Starfish worked great for wordcount .. I didn't run it on my application
because I have only map tasks.

Mark

On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl wrote:

> How was your experience of starfish?
> C
> On Mar 1, 2012, at 12:35 AM, Mark question wrote:
>
> > Thank you for your time and suggestions, I've already tried starfish, but
> > not jmap. I'll check it out.
> > Thanks again,
> > Mark
> >
> > On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl  >wrote:
> >
> >> I assume you have also just tried running locally and using the jdk
> >> performance tools (e.g. jmap) to gain insight by configuring hadoop to
> run
> >> absolute minimum number of tasks?
> >> Perhaps the discussion
> >>
> >>
> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
> >> might be relevant?
> >> On Feb 29, 2012, at 3:53 PM, Mark question wrote:
> >>
> >>> I've used hadoop profiling (.prof) to show the stack trace but it was
> >> hard
> >>> to follow. jConsole locally since I couldn't find a way to set a port
> >>> number to child processes when running them remotely. Linux commands
> >>> (top,/proc), showed me that the virtual memory is almost twice as my
> >>> physical which means swapping is happening which is what I'm trying to
> >>> avoid.
> >>>
> >>> So basically, is there a way to assign a port to child processes to
> >> monitor
> >>> them remotely (asked before by Xun) or would you recommend another
> >>> monitoring tool?
> >>>
> >>> Thank you,
> >>> Mark
> >>>
> >>>
> >>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl <
> charles.ce...@gmail.com
> >>> wrote:
> >>>
> >>>> Mark,
> >>>> So if I understand, it is more the memory management that you are
> >>>> interested in, rather than a need to run an existing C or C++
> >> application
> >>>> in MapReduce platform?
> >>>> Have you done profiling of the application?
> >>>> C
> >>>> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
> >>>>
> >>>>> Thanks Charles .. I'm running Hadoop for research to perform
> duplicate
> >>>>> detection methods. To go deeper, I need to understand what's slowing
> my
> >>>>> program, which usually starts with analyzing memory to predict best
> >> input
> >>>>> size for map task. So you're saying piping can help me control memory
> >>>> even
> >>>>> though it's running on VM eventually?
> >>>>>
> >>>>> Thanks,
> >>>>> Mark
> >>>>>
> >>>>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl <
> >> charles.ce...@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> Mark,
> >>>>>> Both streaming and pipes allow this, perhaps more so pipes at the
> >> level
> >>>> of
> >>>>>> the mapreduce task. Can you provide more details on the application?
> >>>>>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
> >>>>>>
> >>>>>>> Hi guys, thought I should ask this before I use it ... will using C
> >>>> over
> >>>>>>> Hadoop give me the usual C memory management? For example,
> malloc() ,
> >>>>>>> sizeof() ? My guess is no since this all will eventually be turned
> >> into
> >>>>>>> bytecode, but I need more control on memory which obviously is hard
> >> for
> >>>>>> me
> >>>>>>> to do with Java.
> >>>>>>>
> >>>>>>> Let me know of any advantages you know about streaming in C over
> >>>> hadoop.
> >>>>>>> Thank you,
> >>>>>>> Mark
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>


Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl wrote:

> I assume you have also just tried running locally and using the jdk
> performance tools (e.g. jmap) to gain insight by configuring hadoop to run
> absolute minimum number of tasks?
> Perhaps the discussion
>
> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
> might be relevant?
> On Feb 29, 2012, at 3:53 PM, Mark question wrote:
>
> > I've used hadoop profiling (.prof) to show the stack trace but it was
> hard
> > to follow. jConsole locally since I couldn't find a way to set a port
> > number to child processes when running them remotely. Linux commands
> > (top,/proc), showed me that the virtual memory is almost twice as my
> > physical which means swapping is happening which is what I'm trying to
> > avoid.
> >
> > So basically, is there a way to assign a port to child processes to
> monitor
> > them remotely (asked before by Xun) or would you recommend another
> > monitoring tool?
> >
> > Thank you,
> > Mark
> >
> >
> > On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl  >wrote:
> >
> >> Mark,
> >> So if I understand, it is more the memory management that you are
> >> interested in, rather than a need to run an existing C or C++
> application
> >> in MapReduce platform?
> >> Have you done profiling of the application?
> >> C
> >> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
> >>
> >>> Thanks Charles .. I'm running Hadoop for research to perform duplicate
> >>> detection methods. To go deeper, I need to understand what's slowing my
> >>> program, which usually starts with analyzing memory to predict best
> input
> >>> size for map task. So you're saying piping can help me control memory
> >> even
> >>> though it's running on VM eventually?
> >>>
> >>> Thanks,
> >>> Mark
> >>>
> >>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl <
> charles.ce...@gmail.com
> >>> wrote:
> >>>
> >>>> Mark,
> >>>> Both streaming and pipes allow this, perhaps more so pipes at the
> level
> >> of
> >>>> the mapreduce task. Can you provide more details on the application?
> >>>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
> >>>>
> >>>>> Hi guys, thought I should ask this before I use it ... will using C
> >> over
> >>>>> Hadoop give me the usual C memory management? For example, malloc() ,
> >>>>> sizeof() ? My guess is no since this all will eventually be turned
> into
> >>>>> bytecode, but I need more control on memory which obviously is hard
> for
> >>>> me
> >>>>> to do with Java.
> >>>>>
> >>>>> Let me know of any advantages you know about streaming in C over
> >> hadoop.
> >>>>> Thank you,
> >>>>> Mark
> >>>>
> >>>>
> >>
> >>
>
>


Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
I've used hadoop profiling (.prof) to show the stack trace but it was hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark


On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote:

> Mark,
> So if I understand, it is more the memory management that you are
> interested in, rather than a need to run an existing C or C++ application
> in MapReduce platform?
> Have you done profiling of the application?
> C
> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
>
> > Thanks Charles .. I'm running Hadoop for research to perform duplicate
> > detection methods. To go deeper, I need to understand what's slowing my
> > program, which usually starts with analyzing memory to predict best input
> > size for map task. So you're saying piping can help me control memory
> even
> > though it's running on VM eventually?
> >
> > Thanks,
> > Mark
> >
> > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl  >wrote:
> >
> >> Mark,
> >> Both streaming and pipes allow this, perhaps more so pipes at the level
> of
> >> the mapreduce task. Can you provide more details on the application?
> >> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
> >>
> >>> Hi guys, thought I should ask this before I use it ... will using C
> over
> >>> Hadoop give me the usual C memory management? For example, malloc() ,
> >>> sizeof() ? My guess is no since this all will eventually be turned into
> >>> bytecode, but I need more control on memory which obviously is hard for
> >> me
> >>> to do with Java.
> >>>
> >>> Let me know of any advantages you know about streaming in C over
> hadoop.
> >>> Thank you,
> >>> Mark
> >>
> >>
>
>


Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best input
size for map task. So you're saying piping can help me control memory even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl wrote:

> Mark,
> Both streaming and pipes allow this, perhaps more so pipes at the level of
> the mapreduce task. Can you provide more details on the application?
> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>
> > Hi guys, thought I should ask this before I use it ... will using C over
> > Hadoop give me the usual C memory management? For example, malloc() ,
> > sizeof() ? My guess is no since this all will eventually be turned into
> > bytecode, but I need more control on memory which obviously is hard for
> me
> > to do with Java.
> >
> > Let me know of any advantages you know about streaming in C over hadoop.
> > Thank you,
> > Mark
>
>


Streaming Hadoop using C

2012-02-29 Thread Mark question
Hi guys, thought I should ask this before I use it ... will using C over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned into
bytecode, but I need more control on memory which obviously is hard for me
to do with Java.

Let me know of any advantages you know about streaming in C over hadoop.
Thank you,
Mark


Re: memory of mappers and reducers

2012-02-16 Thread Mark question
Great! thanks a lot Srinivas !
Mark

On Thu, Feb 16, 2012 at 7:02 AM, Srinivas Surasani  wrote:

> 1) Yes option 2 is enough.
> 2) Configuration variable "mapred.child.ulimit" can be used to control
> the maximum virtual memory of the child (map/reduce) processes.
>
> ** value of mapred.child.ulimit > value of mapred.child.java.opts
>
> On Thu, Feb 16, 2012 at 12:38 AM, Mark question 
> wrote:
> > Thanks for the reply Srinivas, so option 2 will be enough, however, when
> I
> > tried setting it to 512MB, I see through the system monitor that the map
> > task is given 275MB of real memory!!
> > Is that normal in hadoop to go over the upper bound of memory given by
> the
> > property mapred.child.java.opts.
> >
> > Mark
> >
> > On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani 
> wrote:
> >
> >> Hey Mark,
> >>
> >> Yes, you can limit the memory for each task with
> >> "mapred.child.java.opts" property. Set this to final if no developer
> >> has to change it .
> >>
> >> Little intro to "mapred.task.default.maxvmem"
> >>
> >> This property has to be set on both the JobTracker  for making
> >> scheduling decisions and on the TaskTracker nodes for the sake of
> >> memory management. If a job doesn't specify its virtual memory
> >> requirement by setting mapred.task.maxvmem to -1, tasks are assured a
> >> memory limit set to this property. This property is set to -1 by
> >> default. This value should in general be less than the cluster-wide
> >> configuration mapred.task.limit.maxvmem. If not or if it is not set,
> >> TaskTracker's memory management will be disabled and a scheduler's
> >> memory based scheduling decisions may be affected.
> >>
> >> On Wed, Feb 15, 2012 at 5:57 PM, Mark question 
> >> wrote:
> >> > Hi,
> >> >
> >> >  My question is what's the difference between the following two
> settings:
> >> >
> >> > 1. mapred.task.default.maxvmem
> >> > 2. mapred.child.java.opts
> >> >
> >> > The first one is used by the TT to monitor the memory usage of tasks,
> >> while
> >> > the second one is the maximum heap space assigned for each task. I
> want
> >> to
> >> > limit each task to use upto say 100MB of memory. Can I use only #2 ??
> >> >
> >> > Thank you,
> >> > Mark
> >>
> >>
> >>
> >> --
> >> -- Srinivas
> >> srini...@cloudwick.com
> >>
>
>
>
> --
> -- Srinivas
> srini...@cloudwick.com
>


Re: memory of mappers and reducers

2012-02-15 Thread Mark question
Thanks for the reply Srinivas, so option 2 will be enough, however, when I
tried setting it to 512MB, I see through the system monitor that the map
task is given 275MB of real memory!!
Is that normal in hadoop to go over the upper bound of memory given by the
property mapred.child.java.opts.

Mark

On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani  wrote:

> Hey Mark,
>
> Yes, you can limit the memory for each task with
> "mapred.child.java.opts" property. Set this to final if no developer
> has to change it .
>
> Little intro to "mapred.task.default.maxvmem"
>
> This property has to be set on both the JobTracker  for making
> scheduling decisions and on the TaskTracker nodes for the sake of
> memory management. If a job doesn't specify its virtual memory
> requirement by setting mapred.task.maxvmem to -1, tasks are assured a
> memory limit set to this property. This property is set to -1 by
> default. This value should in general be less than the cluster-wide
> configuration mapred.task.limit.maxvmem. If not or if it is not set,
> TaskTracker's memory management will be disabled and a scheduler's
> memory based scheduling decisions may be affected.
>
> On Wed, Feb 15, 2012 at 5:57 PM, Mark question 
> wrote:
> > Hi,
> >
> >  My question is what's the difference between the following two settings:
> >
> > 1. mapred.task.default.maxvmem
> > 2. mapred.child.java.opts
> >
> > The first one is used by the TT to monitor the memory usage of tasks,
> while
> > the second one is the maximum heap space assigned for each task. I want
> to
> > limit each task to use upto say 100MB of memory. Can I use only #2 ??
> >
> > Thank you,
> > Mark
>
>
>
> --
> -- Srinivas
> srini...@cloudwick.com
>


memory of mappers and reducers

2012-02-15 Thread Mark question
Hi,

  My question is what's the difference between the following two settings:

1. mapred.task.default.maxvmem
2. mapred.child.java.opts

The first one is used by the TT to monitor the memory usage of tasks, while
the second one is the maximum heap space assigned for each task. I want to
limit each task to use upto say 100MB of memory. Can I use only #2 ??

Thank you,
Mark


Namenode no lease exception ... what does it mean?

2012-02-09 Thread Mark question
Hi guys,

Even though there is enough space on HDFS as shown by -report ... I get the
following 2 error shown first in
the log of a datanode and the second on Namenode log:

1)2012-02-09 10:18:37,519 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_8448117986822173955 is added to invalidSet
of 10.0.40.33:50010

2) 2012-02-09 10:18:41,788 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_132544693472320409_2778 on 10.0.40.12:50010 size 67108864 But it does
not belong to any file.
2012-02-09 10:18:41,789 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 12123, call
addBlock(/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247,
DFSClient_attempt_201202090811_0005_m_000247_0) from 10.0.40.12:34103:
error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No
lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247
File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0
does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)

Any other ways to debug this?

Thanks,
Mark


Re: Too many open files Error

2012-01-27 Thread Mark question
Hi Harsh and Idris ... so the only drawback for increasing the value of
xcievers is memory issue? In that case then I'll set it to a value that
doesn't fill the memory ...
Thanks,
Mark

On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali  wrote:

> Hi Mark,
>
> As Harsh pointed out it is not good idea to increase the Xceiver count to
> arbitrarily higher value, I suggested to increase the xceiver count just to
> unblock execution of your program temporarily.
>
> Thanks,
> -Idris
>
> On Fri, Jan 27, 2012 at 10:39 AM, Harsh J  wrote:
>
> > You are technically allowing DN to run 1 million block transfer
> > (in/out) threads by doing that. It does not take up resources by
> > default sure, but now it can be abused with requests to make your DN
> > run out of memory and crash cause its not bound to proper limits now.
> >
> > On Fri, Jan 27, 2012 at 5:49 AM, Mark question 
> > wrote:
> > > Harsh, could you explain briefly why is 1M setting for xceiver is bad?
> > the
> > > job is working now ...
> > > about the ulimit -u it shows  200703, so is that why connection is
> reset
> > by
> > > peer? How come it's working with the xceiver modification?
> > >
> > > Thanks,
> > > Mark
> > >
> > >
> > > On Thu, Jan 26, 2012 at 12:21 PM, Harsh J  wrote:
> > >
> > >> Agree with Raj V here - Your problem should not be the # of transfer
> > >> threads nor the number of open files given that stacktrace.
> > >>
> > >> And the values you've set for the transfer threads are far beyond
> > >> recommendations of 4k/8k - I would not recommend doing that. Default
> > >> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
> > >> when noticing increased HDFS load, or when running services like
> > >> HBase.
> > >>
> > >> You should instead focus on why its this particular job (or even
> > >> particular task, which is important to notice) that fails, and not
> > >> other jobs (or other task attempts).
> > >>
> > >> On Fri, Jan 27, 2012 at 1:10 AM, Raj V  wrote:
> > >> > Mark
> > >> >
> > >> > You have this "Connection reset by peer". Why do you think this
> > problem
> > >> is related to too many open files?
> > >> >
> > >> > Raj
> > >> >
> > >> >
> > >> >
> > >> >>____
> > >> >> From: Mark question 
> > >> >>To: common-user@hadoop.apache.org
> > >> >>Sent: Thursday, January 26, 2012 11:10 AM
> > >> >>Subject: Re: Too many open files Error
> > >> >>
> > >> >>Hi again,
> > >> >>I've tried :
> > >> >> 
> > >> >>dfs.datanode.max.xcievers
> > >> >>1048576
> > >> >>  
> > >> >>but I'm still getting the same error ... how high can I go??
> > >> >>
> > >> >>Thanks,
> > >> >>Mark
> > >> >>
> > >> >>
> > >> >>
> > >> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question  >
> > >> wrote:
> > >> >>
> > >> >>> Thanks for the reply I have nothing about
> > >> dfs.datanode.max.xceivers on
> > >> >>> my hdfs-site.xml so hopefully this would solve the problem and
> about
> > >> the
> > >> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
> > >> Hadoop
> > >> >>> with a single bin/start-all.sh ... Do you think I can add it by
> > >> >>> bin/Datanode -ulimit n ?
> > >> >>>
> > >> >>> Mark
> > >> >>>
> > >> >>>
> > >> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn <
> > mapred.le...@gmail.com
> > >> >wrote:
> > >> >>>
> > >> >>>> U need to set ulimit -n  on datanode and restart
> > >> datanodes.
> > >> >>>>
> > >> >>>> Sent from my iPhone
> > >> >>>>
> > >> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali 
> > wrote:
> > >> >>>>
> > >> >>>> > Hi Mark,
> > >> >>&g

Re: Too many open files Error

2012-01-26 Thread Mark question
Harsh, could you explain briefly why is 1M setting for xceiver is bad? the
job is working now ...
about the ulimit -u it shows  200703, so is that why connection is reset by
peer? How come it's working with the xceiver modification?

Thanks,
Mark


On Thu, Jan 26, 2012 at 12:21 PM, Harsh J  wrote:

> Agree with Raj V here - Your problem should not be the # of transfer
> threads nor the number of open files given that stacktrace.
>
> And the values you've set for the transfer threads are far beyond
> recommendations of 4k/8k - I would not recommend doing that. Default
> in 1.0.0 is 256 but set it to 2048/4096, which are good value to have
> when noticing increased HDFS load, or when running services like
> HBase.
>
> You should instead focus on why its this particular job (or even
> particular task, which is important to notice) that fails, and not
> other jobs (or other task attempts).
>
> On Fri, Jan 27, 2012 at 1:10 AM, Raj V  wrote:
> > Mark
> >
> > You have this "Connection reset by peer". Why do you think this problem
> is related to too many open files?
> >
> > Raj
> >
> >
> >
> >>
> >> From: Mark question 
> >>To: common-user@hadoop.apache.org
> >>Sent: Thursday, January 26, 2012 11:10 AM
> >>Subject: Re: Too many open files Error
> >>
> >>Hi again,
> >>I've tried :
> >> 
> >>    dfs.datanode.max.xcievers
> >>1048576
> >>  
> >>but I'm still getting the same error ... how high can I go??
> >>
> >>Thanks,
> >>Mark
> >>
> >>
> >>
> >>On Thu, Jan 26, 2012 at 9:29 AM, Mark question 
> wrote:
> >>
> >>> Thanks for the reply I have nothing about
> dfs.datanode.max.xceivers on
> >>> my hdfs-site.xml so hopefully this would solve the problem and about
> the
> >>> ulimit -n , I'm running on an NFS cluster, so usually I just start
> Hadoop
> >>> with a single bin/start-all.sh ... Do you think I can add it by
> >>> bin/Datanode -ulimit n ?
> >>>
> >>> Mark
> >>>
> >>>
> >>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn  >wrote:
> >>>
> >>>> U need to set ulimit -n  on datanode and restart
> datanodes.
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Jan 26, 2012, at 6:06 AM, Idris Ali  wrote:
> >>>>
> >>>> > Hi Mark,
> >>>> >
> >>>> > On a lighter note what is the count of xceivers?
> >>>> dfs.datanode.max.xceivers
> >>>> > property in hdfs-site.xml?
> >>>> >
> >>>> > Thanks,
> >>>> > -idris
> >>>> >
> >>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
> >>>> michael_se...@hotmail.com>wrote:
> >>>> >
> >>>> >> Sorry going from memory...
> >>>> >> As user Hadoop or mapred or hdfs what do you see when you do a
> ulimit
> >>>> -a?
> >>>> >> That should give you the number of open files allowed by a single
> >>>> user...
> >>>> >>
> >>>> >>
> >>>> >> Sent from a remote device. Please excuse any typos...
> >>>> >>
> >>>> >> Mike Segel
> >>>> >>
> >>>> >> On Jan 26, 2012, at 5:13 AM, Mark question 
> >>>> wrote:
> >>>> >>
> >>>> >>> Hi guys,
> >>>> >>>
> >>>> >>>  I get this error from a job trying to process 3Million records.
> >>>> >>>
> >>>> >>> java.io.IOException: Bad connect ack with firstBadLink
> >>>> >> 192.168.1.20:50010
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
> >>>> >>>   at
> >>>> >>>
> >>>> >>
> >>>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2

Re: Too many open files Error

2012-01-26 Thread Mark question
No worries ... thanks guys .. I set it to 100M  and it worked :)
Thanks,
Mark

On Thu, Jan 26, 2012 at 11:10 AM, Mark question  wrote:

> Hi again,
> I've tried :
>  
> dfs.datanode.max.xcievers
> 1048576
>   
> but I'm still getting the same error ... how high can I go??
>
> Thanks,
> Mark
>
>
>
>
> On Thu, Jan 26, 2012 at 9:29 AM, Mark question wrote:
>
>> Thanks for the reply I have nothing about dfs.datanode.max.xceivers
>> on my hdfs-site.xml so hopefully this would solve the problem and about the
>> ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop
>> with a single bin/start-all.sh ... Do you think I can add it by
>> bin/Datanode -ulimit n ?
>>
>> Mark
>>
>>
>> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn wrote:
>>
>>> U need to set ulimit -n  on datanode and restart datanodes.
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 26, 2012, at 6:06 AM, Idris Ali  wrote:
>>>
>>> > Hi Mark,
>>> >
>>> > On a lighter note what is the count of xceivers?
>>> dfs.datanode.max.xceivers
>>> > property in hdfs-site.xml?
>>> >
>>> > Thanks,
>>> > -idris
>>> >
>>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
>>> michael_se...@hotmail.com>wrote:
>>> >
>>> >> Sorry going from memory...
>>> >> As user Hadoop or mapred or hdfs what do you see when you do a ulimit
>>> -a?
>>> >> That should give you the number of open files allowed by a single
>>> user...
>>> >>
>>> >>
>>> >> Sent from a remote device. Please excuse any typos...
>>> >>
>>> >> Mike Segel
>>> >>
>>> >> On Jan 26, 2012, at 5:13 AM, Mark question 
>>> wrote:
>>> >>
>>> >>> Hi guys,
>>> >>>
>>> >>>  I get this error from a job trying to process 3Million records.
>>> >>>
>>> >>> java.io.IOException: Bad connect ack with firstBadLink
>>> >> 192.168.1.20:50010
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>>> >>>
>>> >>> When I checked the logfile of the datanode-20, I see :
>>> >>>
>>> >>> 2012-01-26 03:00:11,827 ERROR
>>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> DatanodeRegistration(
>>> >>> 192.168.1.20:50010,
>>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
>>> >>> infoPort=50075, ipcPort=50020):DataXceiver
>>> >>> java.io.IOException: Connection reset by peer
>>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>>> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>> >>>   at
>>> >>>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>> >>>   at
>>> >>>
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>> >>>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>> >>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>>> >>>   at
>>> >>>
>>> >>
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>>> >>>   at java.lang.Thread.run(Thread.java:662)
>>> >>>
>>> >>>
>>> >>> Which is because I'm running 10 maps per taskTracker on a 20 node
>>> >> cluster,
>>> >>> each map opens about 300 files so that should give 6000 opened files
>>> at
>>> >> the
>>> >>> same time ... why is this a problem? the maximum # of files per
>>> process
>>> >> on
>>> >>> one machine is:
>>> >>>
>>> >>> cat /proc/sys/fs/file-max   ---> 2403545
>>> >>>
>>> >>>
>>> >>> Any suggestions?
>>> >>>
>>> >>> Thanks,
>>> >>> Mark
>>> >>
>>>
>>
>>
>


Re: Too many open files Error

2012-01-26 Thread Mark question
Hi again,
I've tried :
 
dfs.datanode.max.xcievers
1048576
  
but I'm still getting the same error ... how high can I go??

Thanks,
Mark



On Thu, Jan 26, 2012 at 9:29 AM, Mark question  wrote:

> Thanks for the reply I have nothing about dfs.datanode.max.xceivers on
> my hdfs-site.xml so hopefully this would solve the problem and about the
> ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop
> with a single bin/start-all.sh ... Do you think I can add it by
> bin/Datanode -ulimit n ?
>
> Mark
>
>
> On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn wrote:
>
>> U need to set ulimit -n  on datanode and restart datanodes.
>>
>> Sent from my iPhone
>>
>> On Jan 26, 2012, at 6:06 AM, Idris Ali  wrote:
>>
>> > Hi Mark,
>> >
>> > On a lighter note what is the count of xceivers?
>> dfs.datanode.max.xceivers
>> > property in hdfs-site.xml?
>> >
>> > Thanks,
>> > -idris
>> >
>> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel <
>> michael_se...@hotmail.com>wrote:
>> >
>> >> Sorry going from memory...
>> >> As user Hadoop or mapred or hdfs what do you see when you do a ulimit
>> -a?
>> >> That should give you the number of open files allowed by a single
>> user...
>> >>
>> >>
>> >> Sent from a remote device. Please excuse any typos...
>> >>
>> >> Mike Segel
>> >>
>> >> On Jan 26, 2012, at 5:13 AM, Mark question 
>> wrote:
>> >>
>> >>> Hi guys,
>> >>>
>> >>>  I get this error from a job trying to process 3Million records.
>> >>>
>> >>> java.io.IOException: Bad connect ack with firstBadLink
>> >> 192.168.1.20:50010
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>> >>>
>> >>> When I checked the logfile of the datanode-20, I see :
>> >>>
>> >>> 2012-01-26 03:00:11,827 ERROR
>> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> >>> 192.168.1.20:50010,
>> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
>> >>> infoPort=50075, ipcPort=50020):DataXceiver
>> >>> java.io.IOException: Connection reset by peer
>> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
>> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
>> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
>> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>> >>>   at
>> >>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>> >>>   at
>> >>>
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>> >>>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>> >>>   at
>> >>>
>> >>
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>> >>>   at java.lang.Thread.run(Thread.java:662)
>> >>>
>> >>>
>> >>> Which is because I'm running 10 maps per taskTracker on a 20 node
>> >> cluster,
>> >>> each map opens about 300 files so that should give 6000 opened files
>> at
>> >> the
>> >>> same time ... why is this a problem? the maximum # of files per
>> process
>> >> on
>> >>> one machine is:
>> >>>
>> >>> cat /proc/sys/fs/file-max   ---> 2403545
>> >>>
>> >>>
>> >>> Any suggestions?
>> >>>
>> >>> Thanks,
>> >>> Mark
>> >>
>>
>
>


Re: Too many open files Error

2012-01-26 Thread Mark question
Thanks for the reply I have nothing about dfs.datanode.max.xceivers on
my hdfs-site.xml so hopefully this would solve the problem and about the
ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop
with a single bin/start-all.sh ... Do you think I can add it by
bin/Datanode -ulimit n ?

Mark

On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn wrote:

> U need to set ulimit -n  on datanode and restart datanodes.
>
> Sent from my iPhone
>
> On Jan 26, 2012, at 6:06 AM, Idris Ali  wrote:
>
> > Hi Mark,
> >
> > On a lighter note what is the count of xceivers?
> dfs.datanode.max.xceivers
> > property in hdfs-site.xml?
> >
> > Thanks,
> > -idris
> >
> > On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel  >wrote:
> >
> >> Sorry going from memory...
> >> As user Hadoop or mapred or hdfs what do you see when you do a ulimit
> -a?
> >> That should give you the number of open files allowed by a single
> user...
> >>
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On Jan 26, 2012, at 5:13 AM, Mark question  wrote:
> >>
> >>> Hi guys,
> >>>
> >>>  I get this error from a job trying to process 3Million records.
> >>>
> >>> java.io.IOException: Bad connect ack with firstBadLink
> >> 192.168.1.20:50010
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
> >>>
> >>> When I checked the logfile of the datanode-20, I see :
> >>>
> >>> 2012-01-26 03:00:11,827 ERROR
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> >>> 192.168.1.20:50010,
> >> storageID=DS-97608578-192.168.1.20-50010-1327575205369,
> >>> infoPort=50075, ipcPort=50020):DataXceiver
> >>> java.io.IOException: Connection reset by peer
> >>>   at sun.nio.ch.FileDispatcher.read0(Native Method)
> >>>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >>>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
> >>>   at sun.nio.ch.IOUtil.read(IOUtil.java:175)
> >>>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> >>>   at
> >>>
> >>
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
> >>>   at
> >>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> >>>   at
> >>>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> >>>   at
> >>>
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> >>>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>>   at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>>   at java.io.DataInputStream.read(DataInputStream.java:132)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
> >>>   at
> >>>
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> >>>   at java.lang.Thread.run(Thread.java:662)
> >>>
> >>>
> >>> Which is because I'm running 10 maps per taskTracker on a 20 node
> >> cluster,
> >>> each map opens about 300 files so that should give 6000 opened files at
> >> the
> >>> same time ... why is this a problem? the maximum # of files per process
> >> on
> >>> one machine is:
> >>>
> >>> cat /proc/sys/fs/file-max   ---> 2403545
> >>>
> >>>
> >>> Any suggestions?
> >>>
> >>> Thanks,
> >>> Mark
> >>
>


Re: connection between slaves and master

2012-01-11 Thread Mark question
exactly right. Thanks Praveen.
Mark

On Tue, Jan 10, 2012 at 1:54 AM, Praveen Sripati
wrote:

> Mark,
>
> > [mark@node67 ~]$ telnet node77
>
> You need to specify the port number along with the server name like `telnet
> node77 1234`.
>
> > 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
>
> Slaves are not able to connect to the master. The configurations `
> fs.default.name` and `mapred.job.tracker` should point to the master and
> not to localhost when the master and slaves are on different machines.
>
> Praveen
>
> On Mon, Jan 9, 2012 at 11:41 PM, Mark question 
> wrote:
>
> > Hello guys,
> >
> >  I'm requesting from a PBS scheduler a number of  machines to run Hadoop
> > and even though all hadoop daemons start normally on the master and
> slaves,
> > the slaves don't have worker tasks in them. Digging into that, there
> seems
> > to be some blocking between nodes (?) don't know how to describe it
> except
> > that on slave if I "telnet master-node"  it should be able to connect,
> but
> > I get this error:
> >
> > [mark@node67 ~]$ telnet node77
> >
> > Trying 192.168.1.77...
> > telnet: connect to address 192.168.1.77: Connection refused
> > telnet: Unable to connect to remote host: Connection refused
> >
> > The log at the slave nodes shows the same thing, even though it has
> > datanode and tasktracker started from the maste (?):
> >
> > 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
> > 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
> > 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
> > 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
> > 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
> > 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
> > 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
> > 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
> > 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
> > 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying
> > connect
> > to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
> > 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at
> > localhost/
> > 127.0.0.1:12123 not available yet, Z...
> >
> >  Any suggestions of what I can do?
> >
> > Thanks,
> > Mark
> >
>


connection between slaves and master

2012-01-09 Thread Mark question
Hello guys,

 I'm requesting from a PBS scheduler a number of  machines to run Hadoop
and even though all hadoop daemons start normally on the master and slaves,
the slaves don't have worker tasks in them. Digging into that, there seems
to be some blocking between nodes (?) don't know how to describe it except
that on slave if I "telnet master-node"  it should be able to connect, but
I get this error:

[mark@node67 ~]$ telnet node77

Trying 192.168.1.77...
telnet: connect to address 192.168.1.77: Connection refused
telnet: Unable to connect to remote host: Connection refused

The log at the slave nodes shows the same thing, even though it has
datanode and tasktracker started from the maste (?):

2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/
127.0.0.1:12123 not available yet, Z...

 Any suggestions of what I can do?

Thanks,
Mark


Re: Expected file://// error

2012-01-08 Thread Mark question
It's already in there ... don't worry about it, I'm submitting the first
job then the second job manually for now.

export CLASSPATH=/home/mark/hadoop-0.20.2/conf:$CLASSPATH
export CLASSPATH=/home/mark/hadoop-0.20.2/lib:$CLASSPATH
export
CLASSPATH=/home/mark/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/mark/hadoop-0.20.2/lib/commons-cli-1.2.jar:$CLASSPATH

Thank you for your time,
Mark

On Sun, Jan 8, 2012 at 12:22 PM, Joey Echeverria  wrote:

> What's the classpath of the java program submitting the job? It has to
> have the configuration directory (e.g. /opt/hadoop/conf) in there or
> it won't pick up the correct configs.
>
> -Joey
>
> On Sun, Jan 8, 2012 at 12:59 PM, Mark question 
> wrote:
> > mapred-site.xml:
> > 
> >  
> >mapred.job.tracker
> >localhost:10001
> >  
> >  
> > mapred.child.java.opts
> > -Xmx1024m
> >  
> >  
> > mapred.tasktracker.map.tasks.maximum
> > 10
> >  
> > 
> >
> >
> > Command is running a script which runs a java program that submit two
> jobs
> > consecutively insuring waiting for the first job ( is working on my
> laptop
> > but on the cluster).
> >
> > On the cluster I get:
> >
> >>
> >>
> hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
> >> > expected: file:///
> >> >at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
> >> >at
> >> >
> >>
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
> >> >at
> >> >
> >>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
> >> >at
> >> >
> >>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> >> >at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
> >> >at
> >> >
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
> >> >at
> >> >
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
> >> >at
> >> >
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
> >> >at
> >> >
> >>
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
> >> >at
> >> >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
> >> >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >> >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >> >at Main.run(Main.java:304)
> >> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> >at Main.main(Main.java:53)
> >> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >at
> >> >
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> >at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >at java.lang.reflect.Method.invoke(Method.java:597)
> >> >at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>
> >
> >
> > The first job output is :
> > folder>_logs 
> > folder>part-0
> >
> > I'm set "folder" as input path to the next job, could it be from the
> "_logs
> > ..." ? but again it worked on my laptop under hadoop-0.21.0. The cluster
> > has hadoop-0.20.2.
> >
> > Thanks,
> > Mark
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


Re: Expected file://// error

2012-01-08 Thread Mark question
mapred-site.xml:

  
mapred.job.tracker
localhost:10001
  
  
 mapred.child.java.opts
 -Xmx1024m
  
  
 mapred.tasktracker.map.tasks.maximum
 10
  



Command is running a script which runs a java program that submit two jobs
consecutively insuring waiting for the first job ( is working on my laptop
but on the cluster).

On the cluster I get:

>
> hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
> > expected: file:///
> >at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
> >at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
> >at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
> >at
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> >at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
> >at
> >
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
> >at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
> >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >at Main.run(Main.java:304)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >at Main.main(Main.java:53)
> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >at java.lang.reflect.Method.invoke(Method.java:597)
> >at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>


The first job output is :
folder>_logs 
folder>part-0

I'm set "folder" as input path to the next job, could it be from the "_logs
..." ? but again it worked on my laptop under hadoop-0.21.0. The cluster
has hadoop-0.20.2.

Thanks,
Mark


Re: Expected file://// error

2012-01-06 Thread Mark question
Hi Harsh, thanks for the reply, you were right, I didn't have hdfs://, but
even after inserting it I still get the error.

java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Mark

On Fri, Jan 6, 2012 at 6:02 AM, Harsh J  wrote:

> What is your fs.default.name set to? It should be set to hdfs://host:port
> and not just host:port. Can you ensure this and retry?
>
> On 06-Jan-2012, at 5:45 PM, Mark question wrote:
>
> > Hello,
> >
> >  I'm running two jobs on Hadoop-0.20.2 consecutively, such that the
> second
> > one reads the output of the first which would look like:
> >
> > outputPath/part-0
> > outputPath/_logs 
> >
> > But I get the error:
> >
> > 12/01/06 03:29:34 WARN fs.FileSystem: "localhost:12123" is a deprecated
> > filesystem name. Use "hdfs://localhost:12123/" instead.
> > java.lang.IllegalArgumentException: Wrong FS:
> >
> hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
> > expected: file:///
> >at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
> >at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
> >at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
> >at
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
> >at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
> >at
> > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
> >at
> >
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
> >at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
> >at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >at Main.run(Main.java:301)
> >at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >at Main.main(Main.java:53)
> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >at java.lang.reflect.Method.invoke(Method.java:597)
> >at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> >
> > This looks similar to the problem described here but for older versions
> > than mine:  https://issues.apache.org/jira/browse/HADOOP-5259
> >
> > I tried applying that patch, but probably due to different versions
> didn't
> > work. Can anyone help?
> > Thank you,
> > Mark
>
>


Expected file://// error

2012-01-06 Thread Mark question
Hello,

  I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second
one reads the output of the first which would look like:

outputPath/part-0
outputPath/_logs 

But I get the error:

12/01/06 03:29:34 WARN fs.FileSystem: "localhost:12123" is a deprecated
filesystem name. Use "hdfs://localhost:12123/" instead.
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar,
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165)
at
org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137)
at
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at Main.run(Main.java:301)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at Main.main(Main.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


This looks similar to the problem described here but for older versions
than mine:  https://issues.apache.org/jira/browse/HADOOP-5259

I tried applying that patch, but probably due to different versions didn't
work. Can anyone help?
Thank you,
Mark


Connection reset by peer Error

2011-11-20 Thread Mark question
Hi,

I've been getting this error multiple times now, the namenode mentions
something about peer resetting connection, but I don't know why this is
happening, because I'm running on a single machine with 12 cores  any
ideas?

The job starting running normally, which contains about 200 mappers each
opens 200 files (one file at a time inside mapper code) then:
..
.
...
11/11/20 06:27:52 INFO mapred.JobClient:  map 55% reduce 0%
11/11/20 06:28:38 INFO mapred.JobClient:  map 56% reduce 0%
11/11/20 06:29:18 INFO mapred.JobClient: Task Id :
attempt_20200450_0001_m_
000219_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mark/output/_temporary/_attempt_20200450_0001_m_000219_0/part-00219
could only be replicated to 0 nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

   ...
   ...

 Namenode Log:

2011-11-20 06:29:51,964 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_aldst=null
perm=null
2011-11-20 06:29:52,039 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G13_12_aqdst=null
perm=null
2011-11-20 06:29:52,178 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_andst=null
perm=null
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_-2308051162058662821_1643 size 20024660
2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000222_0/part-00222
is closed by DFSClient_attempt_20200450_0001_m_000222_0
2011-11-20 06:29:52,351 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
blk_9206172750679206987_1639 size 51330092
2011-11-20 06:29:52,352 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/user/mark/output/_temporary/_attempt_20200450_0001_m_000226_0/part-00226
is closed by DFSClient_attempt_20200450_0001_m_000226_0
2011-11-20 06:29:52,416 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb
ip=/127.0.0.1cmd=create
src=/user/mark/output/_temporary/_attempt_20200450_0001_m_000223_2/part-00223
dst=nullperm=mark:supergroup:rw-r--r--
2011-11-20 06:29:52,430 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 12123: readAndProcess threw exception
java.io.IOException:Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211)
at org.apache.hadoop.ipc.Server

reading Hadoop output messages

2011-11-16 Thread Mark question
Hi all,

   I'm wondering if there is a way to get output messages that are printed
from the main class of a Hadoop job.

Usually "2>&1>> out.log"  would wok, but in this case it only saves the
output messages printed in the main class before  starting the job.
What I want is the output messages that are printed also in the main class
but after the job is done.

For example: in my main class:

try {JobClient.runJob(conf); } catch (Exception e) {
e.printStackTrace();} //submit job to JT
sLogger.info("\n Job Finished in " + (System.currentTimeMillis() -
startTime) / 6.0 + " Minutes.");

I can't see the last message unless I see the screen. Any ideas?

Thank you,
Mark


Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question
Thank you, I'll try it.
Mark

On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooqui wrote:

> Mark,
>
> We figured it out. It's an issue with RedHat's IPTables. You have to open
> up
> those ports:
>
>
> vim /etc/sysconfig/iptables
>
> Make the file look like this
>
> # Firewall configuration written by system-config-firewall
> # Manual customization of this file is not recommended.
> *filter
> :INPUT ACCEPT [0:0]
> :FORWARD ACCEPT [0:0]
> :OUTPUT ACCEPT [0:0]
> -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
> -A INPUT -p icmp -j ACCEPT
> -A INPUT -i lo -j ACCEPT
> -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
> -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT
> -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT
> -A INPUT -m state --state NEW -m tcp -p tcp --dport 50060 -j ACCEPT
> -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT
> -A INPUT -j REJECT --reject-with icmp-host-prohibited
> -A FORWARD -j REJECT --reject-with icmp-host-prohibited
> COMMIT
>
> Restart the web services
> /etc/init.d/iptables restart
> iptables: Flushing firewall rules: [  OK  ]
> iptables: Setting chains to policy ACCEPT: filter  [  OK  ]
> iptables: Unloading modules:   [  OK  ]
> iptables: Applying firewall rules: [  OK  ]
>
>
> On Mon, Oct 24, 2011 at 1:37 PM, Mark question 
> wrote:
>
> > I have the same issue and the output of "curl localhost:50030" is like
> > yours, and I'm running on a remote cluster on pesudo-distributed mode.
> > Can anyone help?
> >
> > Thanks,
> > Mark
> >
> > On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
> > wrote:
> >
> > > Hi guys,
> > >
> > > I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat
> > 6.1
> > > on Amazon EC2 and while my node is healthy, I can't seem to get to the
> > > JobTracker GUI working. Running 'curl localhost:50030' from the CMD
> line
> > > returns a valid HTML file. Ports 50030, 50060, 50070 are open in the
> > Amazon
> > > Security Group. MapReduce jobs are starting and completing
> successfully,
> > so
> > > my Hadoop install is working fine. But when I try to access the web GUI
> > > from
> > > a Chrome browser on my local computer, I get nothing.
> > >
> > > Any thoughts? I tried some Google searches and even did a hail-mary
> Bing
> > > search, but none of them were fruitful.
> > >
> > > Some troubleshooting I did is below:
> > > [root@ip-10-86-x-x ~]# jps
> > > 1337 QuorumPeerMain
> > > 1494 JobTracker
> > > 1410 DataNode
> > > 1629 SecondaryNameNode
> > > 1556 NameNode
> > > 1694 TaskTracker
> > > 1181 HRegionServer
> > > 1107 HMaster
> > > 11363 Jps
> > >
> > >
> > > [root@ip-10-86-x-x ~]# curl localhost:50030
> > > 
> > > 
> > >
> > > 
> > > Hadoop Administration
> > > 
> > >
> > > 
> > >
> > > Hadoop Administration
> > >
> > > 
> > >
> > > JobTracker
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> >
>


Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2

2011-10-24 Thread Mark question
I have the same issue and the output of "curl localhost:50030" is like
yours, and I'm running on a remote cluster on pesudo-distributed mode.
Can anyone help?

Thanks,
Mark

On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui
wrote:

> Hi guys,
>
> I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1
> on Amazon EC2 and while my node is healthy, I can't seem to get to the
> JobTracker GUI working. Running 'curl localhost:50030' from the CMD line
> returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon
> Security Group. MapReduce jobs are starting and completing successfully, so
> my Hadoop install is working fine. But when I try to access the web GUI
> from
> a Chrome browser on my local computer, I get nothing.
>
> Any thoughts? I tried some Google searches and even did a hail-mary Bing
> search, but none of them were fruitful.
>
> Some troubleshooting I did is below:
> [root@ip-10-86-x-x ~]# jps
> 1337 QuorumPeerMain
> 1494 JobTracker
> 1410 DataNode
> 1629 SecondaryNameNode
> 1556 NameNode
> 1694 TaskTracker
> 1181 HRegionServer
> 1107 HMaster
> 11363 Jps
>
>
> [root@ip-10-86-x-x ~]# curl localhost:50030
> 
> 
>
> 
> Hadoop Administration
> 
>
> 
>
> Hadoop Administration
>
> 
>
> JobTracker
>
> 
>
> 
>
> 
>


Remote Blocked Transfer count

2011-10-21 Thread Mark question
Hello,

  I wonder if there is a way to measure how many of the data blocks have
transferred over the network? Or more generally, how many times where there
a connection/contact between different machines?

 I thought of checking the Namenode log file which usually shows blk_
from src= to dst ... but I'm not sure if it's correct to count those lines.

Any ideas are helpful.
Mark


fixing the mapper percentage viewer

2011-10-19 Thread Mark question
Hi all,

 I'm written a custom mapRunner, but it seems to have ruined the percentage
shown for maps on console. I want to know which part of code is responsible
for adjusting the percentage of maps ... Is it the following in MapRunner:

if(incrProcCount) {

  reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,

  SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);


Thank you,
Mark


Re: hadoop input buffer size

2011-10-10 Thread Mark question
Thanks for the clarifications guys :)
Mark

On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 <
mahesw...@huawei.com> wrote:

> I think below can give you more info about it.
>
> http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/
> Nice explanation by Owen here.
>
> Regards,
> Uma
>
> - Original Message -
> From: Yang Xiaoliang 
> Date: Wednesday, October 5, 2011 4:27 pm
> Subject: Re: hadoop input buffer size
> To: common-user@hadoop.apache.org
>
> > Hi,
> >
> > Hadoop neither read one line each time, nor fetching
> > dfs.block.size of lines
> > into a buffer,
> > Actually, for the TextInputFormat, it read io.file.buffer.size
> > bytes of text
> > into a buffer each time,
> > this can be seen from the hadoop source file LineReader.java
> >
> >
> >
> > 2011/10/5 Mark question 
> >
> > > Hello,
> > >
> > >  Correct me if I'm wrong, but when a program opens n-files at
> > the same time
> > > to read from, and start reading from each file at a time 1 line
> > at a time.
> > > Isn't hadoop actually fetching dfs.block.size of lines into a
> > buffer? and
> > > not actually one line.
> > >
> > >  If this is correct, I set up my dfs.block.size = 3MB and each
> > line takes
> > > about 650 bytes only, then I would assume the performance for
> > reading> 1-4000
> > > lines would be the same, but it isn't !  Do you know a way to
> > find #n of
> > > lines to be read at once?
> > >
> > > Thank you,
> > > Mark
> > >
> >
>


hadoop input buffer size

2011-10-04 Thread Mark question
Hello,

  Correct me if I'm wrong, but when a program opens n-files at the same time
to read from, and start reading from each file at a time 1 line at a time.
Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and
not actually one line.

  If this is correct, I set up my dfs.block.size = 3MB and each line takes
about 650 bytes only, then I would assume the performance for reading 1-4000
lines would be the same, but it isn't !  Do you know a way to find #n of
lines to be read at once?

Thank you,
Mark


Mapper Progress

2011-07-21 Thread Mark question
Hi,

   I have my custom MapRunner which apparently seemed to affect the progress
report of the mapper and showing 100% while the mapper is still reading
files to process. Where can I change/add a progress object to be shown in UI
?

Thank you,
Mark


Re: One file per mapper

2011-07-05 Thread Mark question
Hi Govind,

You should use overwrite your FileInputFormat isSplitable function in a
class say myFileInputFormat extends FileInputFormat as follows:

@Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}

Then one you use your myFileInputFormat class. To know the path, write the
following in your mapper class:

@Override
public void configure(JobConf job) {

Path inputPath = new Path(job.get("map.input.file"));

}

~cheers,

Mark

On Tue, Jul 5, 2011 at 1:04 PM, Govind Kothari wrote:

> Hi,
>
> I am new to hadoop. I have a set of files and I want to assign each file to
> a mapper. Also in mapper there should be a way to know the complete path of
> the file. Can you please tell me how to do that ?
>
> Thanks,
> Govind
>
> --
> Govind Kothari
> Graduate Student
> Dept. of Computer Science
> University of Maryland College Park
>
> <---Seek Excellence, Success will Follow --->
>


One node with Rack-local mappers ?!!!

2011-06-16 Thread Mark question
Hi,  this is weird ... I'm running a job on single node with 32 mappers,
running one at a time.

Output says this: ..

11/06/16 00:59:43 INFO mapred.JobClient: Rack-local map tasks=18
==
11/06/16 00:59:43 INFO mapred.JobClient: Launched map tasks=32
11/06/16 00:59:43 INFO mapred.JobClient: Data-local map tasks=14

Number of Hadoop nodes specified by user: 1
Received 1 nodes from PBS
Clean up node: tcc-5-72

When is that usually possible?

Thank you,
Mark


Hadoop Runner

2011-06-11 Thread Mark question
Hi,

  1) Where can I find the "main" class of hadoop? The one that calls the
InputFormat then the MapperRunner and ReducerRunner and others?

This will help me understand what is in memory or still on disk , exact
flow of data between split and mappers .

My problem is, assuming I have a TextInputFormat and would like to modify
the input in memory before being read by RecordReader... where shall I do
that?

InputFormat was my first guess, but unfortunately, it only defines the
logical splits ... So, the only way I can think of is use the recordReader
to read all the records in split into another variable (with the format I
want) then process that variable by map functions.

   But is that efficient? So, to understand this,I hope someone can give an
answer to Q(1)

Thank you,
Mark


Re: org.apache.hadoop.mapred.Utils can not be resolved

2011-06-10 Thread Mark question
Thanks again Harsh.  Yah, learning this would make creating new project
faster!

Mark

On Thu, Jun 9, 2011 at 10:23 PM, Harsh J  wrote:

> "mapred" package would indicate the Hadoop Map/Reduce jar is required.
>
> (Note: If this is for a client op, you may require a few jars from
> lib/ too, like avro, commons-cli or so.. there was a discussion on
> this, can't find it in search right now - you may have better luck).
>
> On Fri, Jun 10, 2011 at 4:22 AM, Mark question 
> wrote:
> > Hi,
> >
> >  My question here is general to this problem. How can you know which jar
> > file will solve such error:
> >
> > *org.apache.hadoop.mapred.Utils  can not be resolved.
> >
> > *I don't plan to include all hadoop jars ... Well, hope so .. Can you
> > tell me your techniques?
> >
> > Thanks,
> > Mark
> > *
> > *
> >
>
>
>
> --
> Harsh J
>


DiskUsage class DU Error

2011-06-09 Thread Mark question
Hi,

Has Anyone tried using DU class to report hdfs-files size?

Both of the following lines are causing errors , running on Mac:

 DU DiskUsage = new DU(new File(outDir.getPath()),12L);
 DU DiskUsage = new DU(new File(outDir.getName()),Configuration)conf);

where, Path outDir = SequenceFileOutputFormat.getOutputPath(conf);  //
Working fine

Exception in thread "main" java.io.IOException: Expecting a line not the end
of stream
at org.apache.hadoop.fs.DU.parseExecResult(DU.java:185)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:238)
at org.apache.hadoop.util.Shell.run(Shell.java:183)
at org.apache.hadoop.fs.DU.(DU.java:57)
at Analysis.analyzeOutput(Analysis.java:22)
at Main.main(Main.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

  I run this DU command after the job is done. Any hints?

Thank you,
Mark


org.apache.hadoop.mapred.Utils can not be resolved

2011-06-09 Thread Mark question
Hi,

  My question here is general to this problem. How can you know which jar
file will solve such error:

*org.apache.hadoop.mapred.Utils  can not be resolved.

*I don't plan to include all hadoop jars ... Well, hope so .. Can you
tell me your techniques?

Thanks,
Mark
*
*


Re: re-reading

2011-06-08 Thread Mark question
I assumed before reading the split API that it is the actual split, my bad.
Thanks alot Harsh, it's working great!

Mark


Re: re-reading

2011-06-08 Thread Mark question
I have a question though for Harsh case... I wrote my custom inputFormat
which will create an array of recordReaders and give them to the MapRunner.

Will that mean multiple copies of the inputSplit are all in memory? or will
there be one copy pointed by all of them .. as if they were pointers ?

Thanks,
Mark

On Wed, Jun 8, 2011 at 9:13 AM, Mark question  wrote:

> Thanks for the replies, but input doesn't have 'clone' I don't know why ...
> so I'll have to write my custom inputFormat ... I was hoping for an easier
> way though.
>
> Thank you,
> Mark
>
>
> On Wed, Jun 8, 2011 at 1:58 AM, Harsh J  wrote:
>
>> Or if that does not work for any reason (haven't tried it really), try
>> writing your own InputFormat wrapper where in you can have direct
>> access to the InputSplit object to do what you want to (open two
>> record readers, and manage them separately).
>>
>> On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert  wrote:
>> > Try input.clone()...
>> >
>> > 2011/6/8 Mark question :
>> >> Hi,
>> >>
>> >>   I'm trying to read the inputSplit over and over using following
>> function
>> >> in MapperRunner:
>> >>
>> >> @Override
>> >>public void run(RecordReader input, OutputCollector output, Reporter
>> >> reporter) throws IOException {
>> >>
>> >>   RecordReader copyInput = input;
>> >>
>> >>  //First read
>> >>   while(input.next(key,value));
>> >>
>> >>  //Second read
>> >>  while(copyInput.next(key,value));
>> >>   }
>> >>
>> >> It can clearly be seen that this won't work because both RecordReaders
>> are
>> >> actually the same. I'm trying to find a way for the second reader to
>> start
>> >> reading the split again from beginning ... How can I do that?
>> >>
>> >> Thanks,
>> >> Mark
>> >>
>> >
>> >
>> >
>> > --
>> > Stefan Wienert
>> >
>> > http://www.wienert.cc
>> > ste...@wienert.cc
>> >
>> > Telefon: +495251-2026838
>> > Mobil: +49176-40170270
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


Re: re-reading

2011-06-08 Thread Mark question
Thanks for the replies, but input doesn't have 'clone' I don't know why ...
so I'll have to write my custom inputFormat ... I was hoping for an easier
way though.

Thank you,
Mark

On Wed, Jun 8, 2011 at 1:58 AM, Harsh J  wrote:

> Or if that does not work for any reason (haven't tried it really), try
> writing your own InputFormat wrapper where in you can have direct
> access to the InputSplit object to do what you want to (open two
> record readers, and manage them separately).
>
> On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert  wrote:
> > Try input.clone()...
> >
> > 2011/6/8 Mark question :
> >> Hi,
> >>
> >>   I'm trying to read the inputSplit over and over using following
> function
> >> in MapperRunner:
> >>
> >> @Override
> >>public void run(RecordReader input, OutputCollector output, Reporter
> >> reporter) throws IOException {
> >>
> >>   RecordReader copyInput = input;
> >>
> >>  //First read
> >>   while(input.next(key,value));
> >>
> >>  //Second read
> >>  while(copyInput.next(key,value));
> >>   }
> >>
> >> It can clearly be seen that this won't work because both RecordReaders
> are
> >> actually the same. I'm trying to find a way for the second reader to
> start
> >> reading the split again from beginning ... How can I do that?
> >>
> >> Thanks,
> >> Mark
> >>
> >
> >
> >
> > --
> > Stefan Wienert
> >
> > http://www.wienert.cc
> > ste...@wienert.cc
> >
> > Telefon: +495251-2026838
> > Mobil: +49176-40170270
> >
>
>
>
> --
> Harsh J
>


re-reading

2011-06-07 Thread Mark question
Hi,

   I'm trying to read the inputSplit over and over using following function
in MapperRunner:

@Override
public void run(RecordReader input, OutputCollector output, Reporter
reporter) throws IOException {

   RecordReader copyInput = input;

  //First read
   while(input.next(key,value));

  //Second read
  while(copyInput.next(key,value));
   }

It can clearly be seen that this won't work because both RecordReaders are
actually the same. I'm trying to find a way for the second reader to start
reading the split again from beginning ... How can I do that?

Thanks,
Mark


Re: Reducing Mapper InputSplit size

2011-06-06 Thread Mark question
Great! Thanks guys :)
Mark

2011/6/6 Panayotis Antonopoulos 

>
> Hi Mark,
>
> Check:
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html
>
> I think that setMaxInputSplitSize(Job job,
> long size)
>
>
> will do what you need.
>
> Regards,
> P.A.
>
> > Date: Mon, 6 Jun 2011 19:31:17 -0700
> > Subject: Reducing Mapper InputSplit size
> > From: markq2...@gmail.com
> > To: common-user@hadoop.apache.org
> >
> > Hi,
> >
> > Does anyone have a way to reduce InputSplit size in general ?
> >
> > By default, the minimum size chunk that map input should be split into is
> > set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some
> > other configuration to reduce the split size and spawn many mappers?
> >
> > Thanks,
> > Mark
>
>


Reducing Mapper InputSplit size

2011-06-06 Thread Mark question
Hi,

Does anyone have a way to reduce InputSplit size in general ?

By default, the minimum size chunk that map input should be split into is
set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some
other configuration to reduce the split size and spawn many mappers?

Thanks,
Mark


Re: SequenceFile.Reader

2011-06-02 Thread Mark question
Actually, I checked the source code of Reader and it turns it reads the
value into a buffer but only returns the key to the user :(  how is this
different than :

Writable value = new Writable();

reader.next(key,value) !!! both are using the same object for multiple
reads. I was hoping next(key) would skip reading value from disk.

Mark

On Thu, Jun 2, 2011 at 6:20 PM, Mark question  wrote:

> Hi John, thanks for the reply. But I'm not asking about the key memory
> allocation here. I'm just saying what's the difference between:
>
> Next(key,value) and Next(key) .  Is the later one still reading the value
> of the key to reach the next key? or does it read the key then using the
> recordSize skips to the next key?
>
> Thanks,
> Mark
>
>
>
>
> On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong wrote:
>
>> On Thu, 2 Jun 2011 15:43:37 -0700, Mark question 
>> wrote:
>> >  Does anyone knows if :  SequenceFile.next(key) is actually not reading
>> > value into memory
>>
>> I think what you're confused by is something I stumbled upon quite by
>> accident.  The secret is that there is actually only ONE Key object that
>> the RecordReader presents to you.  The next() method doesn't create a new
>> Key object (containing the new data) but actually just loads the new data
>> into the existing Key object.
>>
>> The only place I've seen that you absolutely must remember these unusual
>> semantics is when you're trying to copy keys or values for some reason, or
>> to iterate over the Iterable of values more than once.  In these cases you
>> must make defensive copies because otherwise you'll just git a big list of
>> copies of the same Key, containing the last Key data you saw.
>>
>> hth
>>
>
>


Re: SequenceFile.Reader

2011-06-02 Thread Mark question
Hi John, thanks for the reply. But I'm not asking about the key memory
allocation here. I'm just saying what's the difference between:

Next(key,value) and Next(key) .  Is the later one still reading the value of
the key to reach the next key? or does it read the key then using the
recordSize skips to the next key?

Thanks,
Mark



On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong wrote:

> On Thu, 2 Jun 2011 15:43:37 -0700, Mark question 
> wrote:
> >  Does anyone knows if :  SequenceFile.next(key) is actually not reading
> > value into memory
>
> I think what you're confused by is something I stumbled upon quite by
> accident.  The secret is that there is actually only ONE Key object that
> the RecordReader presents to you.  The next() method doesn't create a new
> Key object (containing the new data) but actually just loads the new data
> into the existing Key object.
>
> The only place I've seen that you absolutely must remember these unusual
> semantics is when you're trying to copy keys or values for some reason, or
> to iterate over the Iterable of values more than once.  In these cases you
> must make defensive copies because otherwise you'll just git a big list of
> copies of the same Key, containing the last Key data you saw.
>
> hth
>


SequenceFile.Reader

2011-06-02 Thread Mark question
Hi,

 Does anyone knows if :  SequenceFile.next(key) is actually not reading
value into memory


*next
*(Writable
 key)
  Read the next key in the file into key, skipping its value.
or is it reading the value into memory but not showing it to me ?

Thanks,
Mark


UI not working

2011-05-28 Thread Mark question
Hi,

  My UI for hadoop 20.2 on a single machine suddenly is giving the following
errors for NN and JT web-sites respectively:

HTTP ERROR: 404

/dfshealth.jsp

RequestURI=/dfshealth.jsp

*Powered by Jetty:// *


HTTP ERROR: 503

SERVICE_UNAVAILABLE

RequestURI=/jobtracker.jsp

*Powered by jetty:// *


The only thing I think of, is that I also installed version 21.0 , but had
problems with it so I shut it off and went back to 20.2.

When I check the system for 20.2 using 'fsck' everything looks fine and jobs
work ok.

Let me know how to fix that please.

Thank,
Mark


Re: web site doc link broken

2011-05-27 Thread Mark question
I also got the following from "learn about" :
Not Found

The requested URL /common/docs/stable/ was not found on this server.
--
Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Server at
hadoop.apache.orgPort 80


Mark


On Fri, May 27, 2011 at 8:03 AM, Harsh J  wrote:

> Am not sure if someone's already fixed this, but I head to the first
> link and click Learn About, and it gets redirected to the current/
> just fine. There's only one such link on the page as well.
>
> On Fri, May 27, 2011 at 3:42 AM, Lee Fisher  wrote:
> > Th Hadoop Common home page:
> > http://hadoop.apache.org/common/
> > has a broken link ("Learn About") to the docs. It tries to use:
> > http://hadoop.apache.org/common/docs/stable/
> > which doesn't exist (404). It should probably be:
> > http://hadoop.apache.org/common/docs/current/
> > Or, someone has deleted the stable docs, which I can't help you with. :-)
> > Thanks.
> >
>
>
>
> --
> Harsh J
>


Re: How to copy over using dfs

2011-05-27 Thread Mark question
I don't think so, becauseI read somewhere that this is to insure the safety
of the produced data. Hence Hadoop will force you to do this to know what
exactly is happening.

Mark

On Fri, May 27, 2011 at 12:28 PM, Mohit Anchlia wrote:

> If I have to overwrite a file I generally use
>
> hadoop dfs -rm 
> hadoop dfs -copyFromLocal or -put 
>
> Is there a command to overwrite/replace the file instead of doing rm first?
>


Increase node-mappers capacity in single node

2011-05-26 Thread Mark question
Hi,

  I tried changing "mapreduce.job.maps" to be more than 2 , but since I'm
running in pseudo distributed mode, JobTracker is local and hence this
property is not changed.

  I'm running on a 12 core machine and would like to make use of that ... Is
there a way to trick Hadoop?

I also tried using my virtual machine name instead of "localhost", but no
luck.

Please help,
Thanks,
Mark


Re: one question about hadoop

2011-05-26 Thread Mark question
web.xml is in:

 hadoop-releaseNo/webapps/job/WEB-INF/web.xml

Mark


On Thu, May 26, 2011 at 1:29 AM, Luke Lu  wrote:

> Hadoop embeds jetty directly into hadoop servers with the
> org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml
> is auto generated with the jasper compiler during the build phase. The
> new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop
> HttpServer and doesn't need web.xml and/or jsp support either.
>
> On Thu, May 26, 2011 at 12:14 AM, 王晓峰  wrote:
> > hi,admin:
> >
> >I'm a  fresh fish from China.
> >I want to know how the Jetty combines with the hadoop.
> >I can't find the file named "web.xml" that should exist in usual
> system
> > that combine with Jetty.
> >I'll be very happy to receive your answer.
> >If you have any question,please feel free to contract with me.
> >
> > Best Regards,
> >
> > Jack
> >
>


Re: Sorting ...

2011-05-26 Thread Mark question
Well, I want something like TeraSort but for sequenceFiles instead of Lines
in Text.
My goal is efficiency and I'm currently working with Hadoop only.

Thanks for your suggestions,
Mark

On Thu, May 26, 2011 at 8:34 AM, Robert Evans  wrote:

> Also if you want something that is fairly fast and a lot less dev work to
> get going you might want to look at pig.  They can do a distributed order by
> that is fairly good.
>
> --Bobby Evans
>
> On 5/26/11 2:45 AM, "Luca Pireddu"  wrote:
>
> On May 25, 2011 22:15:50 Mark question wrote:
> > I'm using SequenceFileInputFormat, but then what to write in my mappers?
> >
> >   each mapper is taking a split from the SequenceInputFile then sort its
> > split ?! I don't want that..
> >
> > Thanks,
> > Mark
> >
> > On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu  wrote:
> > > On May 25, 2011 01:43:22 Mark question wrote:
> > > > Thanks Luca, but what other way to sort a directory of sequence
> files?
> > > >
> > > > I don't plan to write a sorting algorithm in mappers/reducers, but
> > > > hoping to use the sequenceFile.sorter instead.
> > > >
> > > > Any ideas?
> > > >
> > > > Mark
> > >
>
>
> If you want to achieve a global sort, then look at how TeraSort does it:
>
> http://sortbenchmark.org/YahooHadoop.pdf
>
> The idea is to partition the data so that all keys in part[i] are < all
> keys
> in part[i+1].  Each partition in individually sorted, so to read the data
> in
> globally sorted order you simply have to traverse it starting from the
> first
> partition and working your way to the last one.
>
> If your keys are already what you want to sort by, then you don't even need
> a
> mapper (just use the default identity map).
>
>
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>
>


Re: UI not working ..

2011-05-25 Thread Mark question
Hi,

>
>   My UI for hadoop 20.2 on a single machine suddenly is giving the
> following errors for NN and JT web-sites respectively:
>
> HTTP ERROR: 404
>
> /dfshealth.jsp
>
> RequestURI=/dfshealth.jsp
>
> *Powered by Jetty:// *
>
>
> HTTP ERROR: 503
>
> SERVICE_UNAVAILABLE
>
> RequestURI=/jobtracker.jsp
>
> *Powered by jetty:// *
>
>
> The only thing I think of, is that I also installed version 21.0 , but had
> problems so I shut it off and went back to 20.2.
>
> When I check the system using 'fsck' everything looks fine though.
>
> Let me know what you think.
>
> Thank,
>
> Mark
>


UI not working ..

2011-05-25 Thread Mark question
Hi,

  My UI for hadoop 20.2 on a single machine suddenly is giving the following
errors for NN and JT web-sites respectively:

HTTP ERROR: 404

/dfshealth.jsp

RequestURI=/dfshealth.jsp

*Powered by Jetty:// *


HTTP ERROR: 503

SERVICE_UNAVAILABLE

RequestURI=/jobtracker.jsp

*Powered by jetty:// *


The only thing I think of, is that I also installed version 21.0 , but had
problems so I shut it off and went back to 20.2.

When I check the system using 'fsck' everything looks fine though.

Let me know what you think.

Thank,

Mark


Re: Sorting ...

2011-05-25 Thread Mark question
I'm using SequenceFileInputFormat, but then what to write in my mappers?

  each mapper is taking a split from the SequenceInputFile then sort its
split ?! I don't want that..

Thanks,
Mark


On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu  wrote:

> On May 25, 2011 01:43:22 Mark question wrote:
> > Thanks Luca, but what other way to sort a directory of sequence files?
> >
> > I don't plan to write a sorting algorithm in mappers/reducers, but hoping
> > to use the sequenceFile.sorter instead.
> >
> > Any ideas?
> >
> > Mark
>
> Maybe this class can help?
>
>  org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat
>
> With it you should be able to read (key,value) records from your sequence
> files
> and then do whatever you need with them.
>
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>


Re: I can't see this email ... So to clarify ..

2011-05-24 Thread Mark question
I do ...

 $ ls -l /cs/student/mark/tmp/hodhod
total 4
drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs

and ..

$ ls -l /tmp/hadoop-mark
total 4
drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs

$ ls -l /tmp/hadoop-maha/dfs/name/   <<<< only name is created here no
data

Thanks agian,
Mark

On Tue, May 24, 2011 at 9:26 PM, Mapred Learn wrote:

> Do u Hv right permissions on the new dirs ?
> Try stopping n starting cluster...
>
> -JJ
>
> On May 24, 2011, at 9:13 PM, Mark question  wrote:
>
> > Well, you're right  ... moving it to hdfs-site.xml had an effect at
> least.
> > But now I'm in the NameSpace incompatable error:
> >
> > WARN org.apache.hadoop.hdfs.server.common.Util: Path
> > /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration
> > files. Please update hdfs configuration.
> > java.io.IOException: Incompatible namespaceIDs in
> /tmp/hadoop-mark/dfs/data
> >
> > My configuration for this part in hdfs-site.xml:
> > 
> > 
> >dfs.data.dir
> >/tmp/hadoop-mark/dfs/data
> > 
> > 
> >dfs.name.dir
> >/tmp/hadoop-mark/dfs/name
> > 
> > 
> >hadoop.tmp.dir
> >/cs/student/mark/tmp/hodhod
> > 
> > 
> >
> > The reason why I want to change hadoop.tmp.dir is because the student
> quota
> > under /tmp is small so I wanted to mount on /cs/student instead for
> > hadoop.tmp.dir.
> >
> > Thanks,
> > Mark
> >
> > On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria 
> wrote:
> >
> >> Try moving the the configuration to hdfs-site.xml.
> >>
> >> One word of warning, if you use /tmp to store your HDFS data, you risk
> >> data loss. On many operating systems, files and directories in /tmp
> >> are automatically deleted.
> >>
> >> -Joey
> >>
> >> On Tue, May 24, 2011 at 10:22 PM, Mark question 
> >> wrote:
> >>> Hi guys,
> >>>
> >>> I'm using an NFS cluster consisting of 30 machines, but only specified
> 3
> >> of
> >>> the nodes to be my hadoop cluster. So my problem is this. Datanode
> won't
> >>> start in one of the nodes because of the following error:
> >>>
> >>> org.apache.hadoop.hdfs.server.
> >>> common.Storage: Cannot lock storage
> /cs/student/mark/tmp/hodhod/dfs/data.
> >>> The directory is already locked
> >>>
> >>> I think it's because of the NFS property which allows one node to lock
> it
> >>> then the second node can't lock it. So I had to change the following
> >>> configuration:
> >>>  dfs.data.dir to be "/tmp/hadoop-user/dfs/data"
> >>>
> >>> But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data
> where
> >> my
> >>> hadoop.tmp.dir = " /cs/student/mark/tmp" as you might guess from above.
> >>>
> >>> Where is this configuration over-written ? I thought my core-site.xml
> has
> >>> the final configuration values.
> >>> Thanks,
> >>> Mark
> >>>
> >>
> >>
> >>
> >> --
> >> Joseph Echeverria
> >> Cloudera, Inc.
> >> 443.305.9434
> >>
>


Re: I can't see this email ... So to clarify ..

2011-05-24 Thread Mark question
Well, you're right  ... moving it to hdfs-site.xml had an effect at least.
But now I'm in the NameSpace incompatable error:

WARN org.apache.hadoop.hdfs.server.common.Util: Path
/tmp/hadoop-mark/dfs/data should be specified as a URI in configuration
files. Please update hdfs configuration.
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-maha/dfs/data

My configuration for this part in hdfs-site.xml:

 
dfs.data.dir
/tmp/hadoop-mark/dfs/data
 
 
dfs.name.dir
/tmp/hadoop-mark/dfs/name
 
 
hadoop.tmp.dir
/cs/student/mark/tmp/hodhod
 


The reason why I want to change hadoop.tmp.dir is because the student quota
under /tmp is small so I wanted to mount on /cs/student instead for
hadoop.tmp.dir.

Thanks,
Mark

On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria  wrote:

> Try moving the the configuration to hdfs-site.xml.
>
> One word of warning, if you use /tmp to store your HDFS data, you risk
> data loss. On many operating systems, files and directories in /tmp
> are automatically deleted.
>
> -Joey
>
> On Tue, May 24, 2011 at 10:22 PM, Mark question 
> wrote:
> > Hi guys,
> >
> > I'm using an NFS cluster consisting of 30 machines, but only specified 3
> of
> > the nodes to be my hadoop cluster. So my problem is this. Datanode won't
> > start in one of the nodes because of the following error:
> >
> > org.apache.hadoop.hdfs.server.
> > common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
> > The directory is already locked
> >
> > I think it's because of the NFS property which allows one node to lock it
> > then the second node can't lock it. So I had to change the following
> > configuration:
> >   dfs.data.dir to be "/tmp/hadoop-user/dfs/data"
> >
> > But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where
> my
> > hadoop.tmp.dir = " /cs/student/mark/tmp" as you might guess from above.
> >
> > Where is this configuration over-written ? I thought my core-site.xml has
> > the final configuration values.
> > Thanks,
> > Mark
> >
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>


I can't see this email ... So to clarify ..

2011-05-24 Thread Mark question
Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.
common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. So I had to change the following
configuration:
   dfs.data.dir to be "/tmp/hadoop-user/dfs/data"

But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my
hadoop.tmp.dir = " /cs/student/mark/tmp" as you might guess from above.

Where is this configuration over-written ? I thought my core-site.xml has
the final configuration values.
Thanks,
Mark


Cannot lock storage, directory is already locked

2011-05-24 Thread Mark question
Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage
/cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. Any ideas on how to solve this error?

Thanks,
Mark


Re: Sorting ...

2011-05-24 Thread Mark question
Thanks Luca, but what other way to sort a directory of sequence files?

I don't plan to write a sorting algorithm in mappers/reducers, but hoping to
use the sequenceFile.sorter instead.

Any ideas?

Mark

On Mon, May 23, 2011 at 12:33 AM, Luca Pireddu  wrote:

>
> On May 22, 2011 03:21:53 Mark question wrote:
> > I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But
> > after taking a couple of minutes .. output is empty.
>
> 
>
> > I'm trying to find what the input format for the TeraSort is, but it is
> not
> > specified.
> >
> > Thanks for any thought,
> > Mark
>
> Terasort sorts lines of text.  The InputFormat (for version 0.20.2) is in
>
>
> hadoop-0.20.2/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java
>
> The documentation at the top of the class says "An input format that reads
> the
> first 10 characters of each line as the key and the rest of the line as the
> value."
>
> HTH
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>


Re: get name of file in mapper output directory

2011-05-24 Thread Mark question
thanks both for the comments, but even though finally, I managed to get the
output file of the current mapper, I couldn't use it because apparently,
mappers uses " _temporary" file while it's in process. So in Mapper.close ,
the file for eg. "part-0" which it wrote to, does not exists yet.

There has to be another way to get the produced file. I need to sort it
immediately within mappers.

Again, your thoughts are really helpful !

Mark

On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu  wrote:

>
>
> The path is defined by the FileOutputFormat in use.  In particular, I think
> this function is responsible:
>
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext
> ,
> java.lang.String)
>
> It should give you the file path before all tasks have completed and the
> output
> is committed to the final output path.
>
> Luca
>
> On May 23, 2011 14:42:04 Joey Echeverria wrote:
> > Hi Mark,
> >
> > FYI, I'm moving the discussion over to
> > mapreduce-u...@hadoop.apache.org since your question is specific to
> > MapReduce.
> >
> > You can derive the output name from the TaskAttemptID which you can
> > get by calling getTaskAttemptID() on the context passed to your
> > cleanup() funciton. The task attempt id will look like this:
> >
> > attempt_200707121733_0003_m_05_0
> >
> > You're interested in the m_05 part, This gets translated into the
> > output file name part-m-5.
> >
> > -Joey
> >
> > On Sat, May 21, 2011 at 8:03 PM, Mark question 
> wrote:
> > > Hi,
> > >
> > >  I'm running a job with maps only  and I want by end of each map
> > > (ie.Close() function) to open the file that the current map has wrote
> > > using its output.collector.
> > >
> > >  I know "job.getWorkingDirectory()"  would give me the parent path of
> the
> > > file written, but how to get the full path or the name (ie. part-0
> or
> > > part-1).
> > >
> > > Thanks,
> > > Mark
>
> --
> Luca Pireddu
> CRS4 - Distributed Computing Group
> Loc. Pixina Manna Edificio 1
> Pula 09010 (CA), Italy
> Tel:  +39 0709250452
>


Re: How hadoop parse input files into (Key,Value) pairs ??

2011-05-22 Thread Mark question
The case your talking about is when you use FileInputFormat ... So usually
the InputFormat Interface is the one responsible for that.

For FileInputFormat, it uses a LineRecordReader which will take your text
file and assigns key to be the offset within your text file and value to be
the line (until '\n') is seen.

If you want to use other InputFormats check its API and pick what is
suitable for you. In my case, I'm hocked with SequenceFileInputFormat where
my input files are  records written by a regular java program (or
parser). Then my Hadoop job will look at the keys and values that I wrote.

I hope this helps a little,
Mark

On Thu, May 5, 2011 at 4:31 AM, praveenesh kumar wrote:

> Hi,
>
> As we know hadoop mapper takes input as (Key,Value) pairs and generate
> intermediate (Key,Value) pairs and usually we give input to our Mapper as a
> text file.
> How hadoop understand this and parse our input text file into (Key,Value)
> Pairs
>
> Usually our mapper looks like  --
> *public* *void* map(LongWritable key, Text value,OutputCollector Text>
> outputCollector, Reporter reporter) *throws* IOException {
>
> String word = value.toString();
>
> //Some lines of code
>
> }
>
> So if I pass any text file as input, it is taking every line as VALUE to
> Mapper..on which I will do some processing and put it to OutputCollector.
> But how hadoop parsed my text file into ( Key,Value ) pair and how can we
> tell hadoop what (key,value) it should give to mapper ??
>
> Thanks.
>


I didn't see my email sent yesterday ... So here is the question again ..

2011-05-22 Thread Mark question
Hi,

  I'm running a job with maps only  and I want by end of each map (ie. in
its Close() function) to open the file that the current map has wrote using
its output.collector.

  I know "job.getWorkingDirectory()"  would give me the parent path of the
file written, but how to get the full path or the name of the file that this
mapper have been assigned to (ie. part-0 or part-1).

Thanks,
Mark


Re: current line number as key?

2011-05-21 Thread Mark question
What if you run a MapReduce program to generate a Sequence File from your
text file where key is the line number and value is the whole line, then for
the second job, the splits are done record wise hence, each mapper will be
getting a split/block of records [] ~Cheers,
Mark

On Wed, May 18, 2011 at 12:18 PM, Robert Evans  wrote:

> You are correct, that there is no easy and efficient way to do this.
>
> You could create a new InputFormat that derives from FileInputFormat that
> makes it so the files do not split, and then have a RecordReader that keeps
> track of line numbers.  But then each file is read by only one mapper.
>
> Alternatively you could assume that the split is going to be done
> deterministically and do two passes one, where you count the number of lines
> in each partition, and a second that then assigns the lines based off of the
> output from the first.  But that requires two map passes.
>
> --Bobby Evans
>
>
> On 5/18/11 1:53 PM, "Alexandra Anghelescu"  wrote:
>
> Hi,
>
> It is hard to pick up certain lines of a text file - globally I mean.
> Remember that the file is split according to its size (byte boundries) not
> lines.,, so, it is possible to keep track of the lines inside a split, but
> globally for the whole file, assuming it is split among map tasks... i
> don't
> think it is possible.. I am new to hadoop, but that is my take on it.
>
> Alexandra
>
> On Wed, May 18, 2011 at 2:41 PM, bnonymous  wrote:
>
> >
> > Hello,
> >
> > I'm trying to pick up certain lines of a text file. (say 1st, 110th line
> of
> > a file with 10^10 lines). I need a InputFormat which gives the Mapper
> line
> > number as the key.
> >
> > I tried to implement RecordReader, but I can't get line information from
> > InputSplit.
> >
> > Any solution to this???
> >
> > Thanks in advance!!!
> > --
> > View this message in context:
> >
> http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>


Sorting ...

2011-05-21 Thread Mark question
I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But
after taking a couple of minutes .. output is empty.

HDFS has the following Sequence files:
-rw-r--r--   1 Hadoop supergroup  196113760 2011-05-21 12:16
/user/Hadoop/out/part-0
-rw-r--r--   1 Hadoop supergroup  250935096 2011-05-21 12:16
/user/Hadoop/out/part-1
-rw-r--r--   1 Hadoop supergroup  262943648 2011-05-21 12:17
/user/Hadoop/out/part-2
-rw-r--r--   1 Hadoop supergroup  114888492 2011-05-21 12:17
/user/Hadoop/out/part-3

After running:  hadoop jar hadoop-mapred-examples-0.21.0.jar terasort out
sorted
Error is:
   
11/05/21 18:13:12 INFO mapreduce.Job:  map 74% reduce 20%
11/05/21 18:13:14 INFO mapreduce.Job: Task Id :
attempt_201105202144_0039_m_09_0, Status : FAILED
java.io.EOFException: read past eof

I'm trying to find what the input format for the TeraSort is, but it is not
specified.

Thanks for any thought,
Mark


get name of file in mapper output directory

2011-05-21 Thread Mark question
Hi,

  I'm running a job with maps only  and I want by end of each map
(ie.Close() function) to open the file that the current map has wrote using
its output.collector.

  I know "job.getWorkingDirectory()"  would give me the parent path of the
file written, but how to get the full path or the name (ie. part-0 or
part-1).

Thanks,
Mark


Re: outputCollector vs. Localfile

2011-05-20 Thread Mark question
I thought it was, because of FileBytesWritten counter. Thanks for the
clarification.
Mark

On Fri, May 20, 2011 at 4:23 AM, Harsh J  wrote:

> Mark,
>
> On Fri, May 20, 2011 at 10:17 AM, Mark question 
> wrote:
> > This is puzzling me ...
> >
> >  With a mapper producing output of size ~ 400 MB ... which one is
> supposed
> > to be faster?
> >
> >  1) output collector: which will write to local file then copy to HDFS
> since
> > I don't have reducers.
>
> A regular map-only job does not write to the local FS, it writes to
> the HDFS directly (i.e., a local DN if one is found).
>
> --
> Harsh J
>


outputCollector vs. Localfile

2011-05-19 Thread Mark question
This is puzzling me ...

  With a mapper producing output of size ~ 400 MB ... which one is supposed
to be faster?

 1) output collector: which will write to local file then copy to HDFS since
I don't have reducers.

  2) Open a unique local file inside "mapred.local.dir" for each mapper.

   I thought of (2), but (1) was actually faster ... can someone explains ?

 Thanks,
Mark


Re: How do you run HPROF locally?

2011-05-17 Thread Mark question
or conf.setBoolean("mapred.task.profile", true);

Mark

On Tue, May 17, 2011 at 4:49 PM, Mark question  wrote:

> I usually do this setting inside my java program (in run function) as
> follows:
>
> JobConf conf = new JobConf(this.getConf(),My.class);
> conf.set("*mapred*.task.*profile*", "true");
>
> then I'll see some output files in that same working directory.
>
> Hope that helps,
> Mark
>
>
> On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill  wrote:
>
>> I am running a Hadoop Java program in local single-JVM mode via an IDE
>> (IntelliJ).  I want to do performance profiling of it.  Following the
>> instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the
>> following properties to my job configuration file.
>>
>>
>>  
>>mapred.task.profile
>>true
>>  
>>
>>  
>>mapred.task.profile.params
>>
>>
>> -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s
>>  
>>
>>  
>>mapred.task.profile.maps
>>0-
>>  
>>
>>  
>>mapred.task.profile.reduces
>>0-
>>  
>>
>>
>> With these properties, the job runs as before, but I don't see any
>> profiler
>> output.
>>
>> I also tried simply setting
>>
>>
>>  
>>mapred.child.java.opts
>>
>>
>> -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s
>>  
>>
>>
>> Again, no profiler output.
>>
>> I know I have HPROF installed because running "java -agentlib:hprof=help"
>> at
>> the command prompt produces a result.
>>
>> Is is possible to run HPROF on a local Hadoop job?  Am I doing something
>> wrong?
>>
>
>


Re: How do you run HPROF locally?

2011-05-17 Thread Mark question
I usually do this setting inside my java program (in run function) as
follows:

JobConf conf = new JobConf(this.getConf(),My.class);
conf.set("*mapred*.task.*profile*", "true");

then I'll see some output files in that same working directory.

Hope that helps,
Mark

On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill  wrote:

> I am running a Hadoop Java program in local single-JVM mode via an IDE
> (IntelliJ).  I want to do performance profiling of it.  Following the
> instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the
> following properties to my job configuration file.
>
>
>  
>mapred.task.profile
>true
>  
>
>  
>mapred.task.profile.params
>
>
> -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s
>  
>
>  
>mapred.task.profile.maps
>0-
>  
>
>  
>mapred.task.profile.reduces
>0-
>  
>
>
> With these properties, the job runs as before, but I don't see any profiler
> output.
>
> I also tried simply setting
>
>
>  
>mapred.child.java.opts
>
>
> -agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s
>  
>
>
> Again, no profiler output.
>
> I know I have HPROF installed because running "java -agentlib:hprof=help"
> at
> the command prompt produces a result.
>
> Is is possible to run HPROF on a local Hadoop job?  Am I doing something
> wrong?
>


Re: Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Thanks for the inputs, but I'm running  on a university cluster, not my own
and hence are the assumptions such as each task(mapper/reduer) will take 1
GB valid ?

So I guess to tune performance I should try running the job multiple times
and rely on execution time as an indicator of success.

Thanks again,
Mark

On Tue, May 17, 2011 at 3:16 PM, Konstantin Boudnik  wrote:

> Also, it seems like Ganglia would be very well complemented by Nagios
> to allow you to monitor an overall health of your cluster.
> --
>   Take care,
> Konstantin (Cos) Boudnik
> 2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622
>
> Disclaimer: Opinions expressed in this email are those of the author,
> and do not necessarily represent the views of any company the author
> might be affiliated with at the moment of writing.
>
> On Tue, May 17, 2011 at 15:15, Allen Wittenauer  wrote:
> >
> > On May 17, 2011, at 3:11 PM, Mark question wrote:
> >
> >> So what other memory consumption tools do you suggest? I don't want to
> do it
> >> manually and dump statistics into file because IO will affect
> performance
> >> too.
> >
> >We watch memory with Ganglia.  We also tune our systems such that
> a task will only take X amount.  In other words, given an 8gb RAM:
> >
> >1gb for the OS
> >1gb for the TT and DN
> >6gb for all tasks
> >
> >if we assume each task will take max 1gb, then we end up with 3
> maps and 3 reducers.
> >
> >Keep in mind that the mem consumed is more than just JVM heap
> size.
>


Re: Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
So what other memory consumption tools do you suggest? I don't want to do it
manually and dump statistics into file because IO will affect performance
too.

Thanks,
Mark

On Tue, May 17, 2011 at 2:58 PM, Allen Wittenauer  wrote:

>
> On May 17, 2011, at 1:01 PM, Mark question wrote:
>
> > Hi
> >
> >  I need to use hadoop-tool-kit for monitoring. So I followed
> > http://code.google.com/p/hadoop-toolkit/source/checkout
> >
> > and applied the patch in my hadoop.20.2 directory as: patch -p0 <
> patch.20.2
>
> Looking at the code, be aware this is going to give incorrect
> results/suggestions for certain stats it generates when multiple jobs are
> running.
>
>It also seems to lack "the algorithm should be rewritten" and "the
> data was loaded incorrectly" suggestions, which is usually the proper answer
> for perf problems 80% of the time.


Again ... Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Sorry for the spam, but I didn't see my previous email yet.

  I need to use hadoop-tool-kit for monitoring. So I followed
http://code.google.com/p/hadoop-toolkit/source/checkout

and applied the patch in my hadoop.20.2 directory as: patch -p0 < patch.20.2


and set a property *“mapred.performance.diagnose”* to true in *
mapred-site.xml*.

but I don't see the memory stuff information that it's supposed to be shown
as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring

I then installed hadoop-0.21.0 and only set the same property as above, but
still don't see the requested monitor infos.

  ... What's wrong I'm doing ?

I appreciate any thoughts,
Mark


Hadoop tool-kit for monitoring

2011-05-17 Thread Mark question
Hi

  I need to use hadoop-tool-kit for monitoring. So I followed
http://code.google.com/p/hadoop-toolkit/source/checkout

and applied the patch in my hadoop.20.2 directory as: patch -p0 < patch.20.2


and set a property *“mapred.performance.diagnose”* to true in *
mapred-site.xml*.

but I don't see the memory stuff information that it's supposed to be shown
as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring

I then installed hadoop-0.21.0 and only set the same property as above, but
still don't see the requested monitor infos.

  ... What's wrong I'm doing ?

I appreciate any thoughts,
Mark


Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Then which class is filling the
Thanks again Owen, hopefully last but:

   Who's filling the map.input.file and map.input.offset (ie. which class)
so I can extend it to have a function to return these strings.

Thanks,
Mark

On Thu, May 12, 2011 at 10:07 PM, Owen O'Malley  wrote:

> On Thu, May 12, 2011 at 9:23 PM, Mark question 
> wrote:
>
> >  So there is no way I can see the other possible splits (start+length)?
> > like
> > some function that returns strings of map.input.file and map.input.offset
> > of
> > the other mappers ?
> >
>
> No, there isn't any way to do it using the public API.
>
> The only way would be to look under the covers and read the split file
> (job.split).
>
> -- Owen
>


Re: how to get user-specified Job name from hadoop for running jobs?

2011-05-12 Thread Mark question
you mean by "user-specified" is when you write your job name via
JobConf.setJobName("myTask") ?
Then using the same object you can recall your name as follows:

JobConf conf ;
conf.getJobName() ;

~Cheers
Mark

On Tue, May 10, 2011 at 10:16 AM, Mark Zand  wrote:

> While I can get JobStatus with this:
>
> JobClient client = new JobClient(new JobConf(conf));
> JobStatus[] jobStatuses = client.getAllJobs();
>
>
> I don't see any way to get user-specified Job name.
>
> Please help. Thanks.
>


Re: Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Thanks for the reply Owen, I only knew about map.input.file.

 So there is no way I can see the other possible splits (start+length)? like
some function that returns strings of map.input.file and map.input.offset of
the other mappers ?

Thanks,
Mark

On Thu, May 12, 2011 at 9:08 PM, Owen O'Malley  wrote:

> On Thu, May 12, 2011 at 8:59 PM, Mark question 
> wrote:
>
> > Hi
> >
> >   I'm using FileInputFormat which will split files logically according to
> > their sizes into splits. Can the mapper get a pointer to these splits?
> and
> > know which split it is assigned ?
> >
>
> Look at
>
> http://hadoop.apache.org/common/docs/r0.20.203.0/mapred_tutorial.html#Task+JVM+Reuse
>
>  In particular, map.input.file and map.input.offset are the configuration
> parameters that you want.
>
> -- Owen
>


I can't see my messages immediately, and sometimes doesn't even arrive why !

2011-05-12 Thread Mark question



Can Mapper get paths of inputSplits ?

2011-05-12 Thread Mark question
Hi

   I'm using FileInputFormat which will split files logically according to
their sizes into splits. Can the mapper get a pointer to these splits? and
know which split it is assigned ?

   I tried looking up the Reporter class and see how is it printing the
logical splits on the UI for each mapper .. but it's an interface.

   Eg.
Mapper1:  is assigned the logical split
"hdfs://localhost:9000/user/Hadoop/input:23+24"
Mapper2:  is assigned the logical split
"hdfs://localhost:9000/user/Hadoop/input:0+23"

 Then inside map, I want to ask what are the logical splits and get the
upper two strings and know which one my current mapper is assigned.

 Thanks,
Mark


Space needed to user SequenceFile.Sorter

2011-04-28 Thread Mark question
I don't know why I can't see my emails immediately sent to the group ...
anyways,

I'm sorting a sequenceFile using it's sorter on my local filesystem. The
inputFile size is 1937690478 bytes.

but after 14 minutes of sorting.. I get :

TEST SORTING ..
java.io.FileNotFoundException: File does not exist:
/usr/mark/tmp/mapred/local/SortedOutput.0
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1353)
at
org.apache.hadoop.io.SequenceFile$Sorter.cloneFileAttributes(SequenceFile.java:2663)
at
org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:2712)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2285)
at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2324)
at
CrossPartitionSimilarity.TestSorter(CrossPartitionSimilarity.java:164)
at CrossPartitionSimilarity.main(CrossPartitionSimilarity.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Yet, the file is still there:  wc -c SortedOutput.0   --->  1918661230
../tmp/mapred/local/SortedOutput.0
and  if it is because of space, I checked and it can hold up to 209 GB. So,
my question are there restrictions on some JVM configurations that I should
take care of ?

Thank you,
Maha


Required Space for SequenceFile.Sorter

2011-04-28 Thread Mark question
Hi,

   Has anyone know how much space (memory,heap,disk,..) is needed for the
Sequencefile.Sorter to sort an input size of say Y bytes.
   Like a formula with Y for example?

Thanks,
Mark


Re: Reading from File

2011-04-27 Thread Mark question
On Tue, Apr 26, 2011 at 11:49 PM, Harsh J  wrote:

> Hello Mark,
>
> On Wed, Apr 27, 2011 at 12:19 AM, Mark question 
> wrote:
> > Hi,
> >
> >   My mapper opens a file and read records using next() . However, I want
> to
> > stop reading if there is no memory available. What confuses me here is
> that
> > even though I'm reading record by record with next(), hadoop actually
> reads
> > them in dfs.block.size. So, I have two questions:
>
> The dfs.block.size is a HDFS property, and does not have a rigid
> relationship with InputSplits in Hadoop MapReduce. It is used as hints
> for constructing offsets and lengths of splits for the RecordReaders
> to seek-and-read from and until.
>
> > 1. Is it true that even if I set dfs.block.size to 512 MB, then at least
> one
> > block is loaded in memory for mapper to process (part of inputSplit)?
>
> Blocks are not pre-loaded into memory, they are merely read off the FS
> record by record (or buffer by buffer, if you please).
>


I assume the record reader actually have  a couple of records read from
disk into memory/buffer to be handed record by record to the maps. It can
not be the case that each recordReader.next() is reading one record at a
time from disk . So my question is how much is read into the buffer from
disk at once by the recordReader ? Is there a parameter for the space of
memory for this buffering?




> You shouldn't really have memory issues with any of the
> Hadoop-provided RecordReaders as long as individual records fit well
> into available Task JVM memory grants.
>
> > 2. How can I read multiple records from a sequenceFile at once and will
> it
> > make a difference ?
>
> Could you clarify on what it is you seek here? Do you want to supply
> your mappers with N records every call via a sequence file or do you
> merely look to do this to avoid some memory issues as stated above?
>
> In case of the former, it would be better if your Sequence Files were
> prepared with batched records instead of writing a custom N-line
> splitting InputFormat for the SequenceFiles (which will need to
> inspect the file pre-submit).
>
> Have I understood your questions right?
>
> My mapper have other SequenceFiles opened to be read from inside map
functions. So inside map, I used a Sequencefile.Reader and used its next()
to grab 1 record. Now I'm looking for a function that does nextNrecords() of
the opened sequence file. I'm thinking of this because in general,
disk-buffered-reading of multiple blocks is better than reading
block-by-block due to syscall overhead. Does that make sense?  Unless, you
say that next() actually buffers a multiple records even if the user
requested one record or next().


> --
> Harsh J
>


Reading from File

2011-04-26 Thread Mark question
Hi,

   My mapper opens a file and read records using next() . However, I want to
stop reading if there is no memory available. What confuses me here is that
even though I'm reading record by record with next(), hadoop actually reads
them in dfs.block.size. So, I have two questions:

1. Is it true that even if I set dfs.block.size to 512 MB, then at least one
block is loaded in memory for mapper to process (part of inputSplit)?

2. How can I read multiple records from a sequenceFile at once and will it
make a difference ?

Thanks,
Mark


Re: Configured Memory Capacity

2011-04-25 Thread Mark question
I think I changed it using :

  
dfs.datanode.du.reserved
1073741824
true
  

Hope that helps,
Mark

On Mon, Apr 25, 2011 at 9:41 AM, maha  wrote:

> Hi,
>
>  I'm running out of memory as shown by the fsck -report
>
>  Decommission Status : Normal
>  Configured Capacity: 4227530752 (3.94 GB)
>  DFS Used: 30007296 (28.62 MB)
>  Non DFS Used: 4195323904 (3.91 GB)
>  DFS Remaining: 2199552(2.1 MB)
>  DFS Used%: 0.71%
>  DFS Remaining%: 0.05%
>  Last contact: Sun Apr 24 15:50:36 PDT 2011
>
>Can I change the configured capacity ? or is it set up automatically
> by Hadoop based on available resources?
> Thanks,
> Maha


Re: Sequence.Sorter Performance

2011-04-25 Thread Mark question
Thanks Owen !
Mark

On Mon, Apr 25, 2011 at 11:43 AM, Owen O'Malley  wrote:

> The SequenceFile sorter is ok. It used to be the sort used in the shuffle.
> *grin*
>
> Make sure to set io.sort.factor and io.sort.mb to appropriate values for
> your hardware. I'd usually use io.sort.factor as 25 * drives and io.sort.mb
> is the amount of memory you can allocate to the sorting.
>
> -- Owen
>


SequenceFile.Sorter

2011-04-24 Thread Mark question
Hi guys,

I'm trying to sort a 2.5 GB sequence file in one mapper using its
implemented Sort function, but it's taking long that the map is killed for
not reporting .

I would increase the default time to get reports from the mapper, but I'll
do this only if sorting using SequenceFile.sorter is known to be optimal ...
Any one knows ?
Or other suggested options?

Thanks,

Mark


SequenceFile.Sorter performance

2011-04-24 Thread Mark question
Hi guys,

I'm trying to sort a 2.5 GB sequence file in one mapper using its
implemented Sort function, but it's taking long that the map is killed for
not reporting .

I would increase the default time to get reports from the mapper, but I'll
do this only if sorting using SequenceFile.sorter is known to be optimal ...
Any one knows ?
Or other suggested options?

Thanks,

Mark