reducer outofmemoryerror

2008-04-23 Thread Apurva Jadhav

Hi,
 I have a 4 node hadoop 0.15.3 cluster. I am using the default config 
files. I am running a map reduce job to process 40 GB log data.

Some reduce tasks are failing with the following errors:
1)
stderr
Exception in thread org.apache.hadoop.io.ObjectWritable Connection 
Culler Exception in thread 
[EMAIL PROTECTED] 
java.lang.OutOfMemoryError: Java heap space
Exception in thread IPC Client connection to /127.0.0.1:34691 
java.lang.OutOfMemoryError: Java heap space

Exception in thread main java.lang.OutOfMemoryError: Java heap space

2)
stderr
Exception in thread org.apache.hadoop.io.ObjectWritable Connection 
Culler java.lang.OutOfMemoryError: Java heap space


syslog:
2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200804212359_0007_r_04_0 Merge of the 19 files in 
InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca

l/task_200804212359_0007_r_04_0/map_22600.out
2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client: 
java.net.SocketException: Socket closed

   at java.net.SocketInputStream.read(SocketInputStream.java:162)
   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)

2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child

java.lang.OutOfMemoryError: Java heap space
2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner: 
Communication exception: java.lang.OutOfMemoryError: Java heap space


Has anyone experienced similar problem ?   Is there any configuration 
change that can help resolve this issue.


Regards,
aj





RE: Can Hadoop process pictures and videos?

2008-04-23 Thread Roland Rabben
Thanks
Do you have any pointers to how you would normally approach this? Is
Hadoop Streaming the solution I am looking for?

Many thanks again.

Roland

 -Original Message-
 From: Ted Dunning [mailto:[EMAIL PROTECTED]
 Sent: 22. april 2008 22:54
 To: core-user@hadoop.apache.org
 Subject: Re: Can Hadoop process pictures and videos?
 
 
 Yes you can.
 
 One issue is typically that linux based video codecs are not as
 numberous as
 windows based codecs so you may be a bit limited as to what kinds of
 video
 you can process.
 
 Also, most video processing and transcoding is embarrassingly parallel
 at
 the file level with little need for map-reduce.  That may make hadoop
 less
 useful than it might otherwise be.  On the other hand, hadoop does
 expose
 URL's from which you can read file data so you might be just as happy
 using
 that.
 
 
 On 4/22/08 1:48 PM, Roland Rabben [EMAIL PROTECTED]
wrote:
 
  Hi
 
  Sorry for my ignorance, but I am trying to understand if I can use
  Hadoop and Map/Reduce to process video files and images. Encoding
and
  transcoding videos is an example of what I would like to do.
 
 
 
  Thank you for your patience.
 
 
 
  Regards
 
  Roland
 



Re: Can Hadoop process pictures and videos?

2008-04-23 Thread Owen O'Malley


On Apr 22, 2008, at 11:44 PM, Roland Rabben wrote:


Thanks
Do you have any pointers to how you would normally approach this? Is
Hadoop Streaming the solution I am looking for?


Probably not, given that Hadoop streaming doesn't work well with  
binary data. It is probably easiest to the use the C++ interface.  
Look at the examples in src/examples/pipes.


-- Owen


User accounts in Master and Slaves

2008-04-23 Thread Sridhar Raman
After trying out Hadoop in a single machine, I decided to run a MapReduce
across multiple machines.  This is the approach I followed:
1 Master
1 Slave

(A doubt here:  Can my Master also be used to execute the Map/Reduce
functions?)

To do this, I set up the masters and slaves files in the conf directory.
Following the instructions in this page -
http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set up
sshd in both the machines, and was able to ssh from one to the other.

I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a password
for [EMAIL PROTECTED], while in slave, there was only user2.  While in master,
user1 was the logged on user.  How do I resolve this?  Should the user
accounts be present in all the machines?  Or can I specify this somewhere?


Re: reducer outofmemoryerror

2008-04-23 Thread Harish Mallipeddi
Memory settings are in conf/hadoop-default.xml. You can override them in
conf/hadoop-site.xml.

Specifically I think you would want to change mapred.child.java.opts

On Wed, Apr 23, 2008 at 2:40 PM, Apurva Jadhav [EMAIL PROTECTED] wrote:

 Hi,
  I have a 4 node hadoop 0.15.3 cluster. I am using the default config
 files. I am running a map reduce job to process 40 GB log data.
 Some reduce tasks are failing with the following errors:
 1)
 stderr
 Exception in thread org.apache.hadoop.io.ObjectWritable Connection
 Culler Exception in thread
 [EMAIL PROTECTED]
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread IPC Client connection to /127.0.0.1:34691
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread main java.lang.OutOfMemoryError: Java heap space

 2)
 stderr
 Exception in thread org.apache.hadoop.io.ObjectWritable Connection
 Culler java.lang.OutOfMemoryError: Java heap space

 syslog:
 2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200804212359_0007_r_04_0 Merge of the 19 files in
 InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca
 l/task_200804212359_0007_r_04_0/map_22600.out
 2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client:
 java.net.SocketException: Socket closed
   at java.net.SocketInputStream.read(SocketInputStream.java:162)
   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)

 2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.lang.OutOfMemoryError: Java heap space
 2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner:
 Communication exception: java.lang.OutOfMemoryError: Java heap space

 Has anyone experienced similar problem ?   Is there any configuration
 change that can help resolve this issue.

 Regards,
 aj






-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


Re: User accounts in Master and Slaves

2008-04-23 Thread Harish Mallipeddi
On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED]
wrote:

 After trying out Hadoop in a single machine, I decided to run a MapReduce
 across multiple machines.  This is the approach I followed:
 1 Master
 1 Slave

 (A doubt here:  Can my Master also be used to execute the Map/Reduce
 functions?)


If you add the master node to the list of slaves (conf/slaves), then the
master node run will also run a TaskTracker.



 To do this, I set up the masters and slaves files in the conf directory.
 Following the instructions in this page -
 http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set
 up
 sshd in both the machines, and was able to ssh from one to the other.

 I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a password
 for [EMAIL PROTECTED], while in slave, there was only user2.  While in master,
 user1 was the logged on user.  How do I resolve this?  Should the user
 accounts be present in all the machines?  Or can I specify this somewhere?




-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


Re: reducer outofmemoryerror

2008-04-23 Thread Amar Kamat

Apurva Jadhav wrote:

Hi,
 I have a 4 node hadoop 0.15.3 cluster. I am using the default config 
files. I am running a map reduce job to process 40 GB log data.
How many maps and reducers are there? Make sure that there are 
sufficient number of reducers. Look at conf/hadoop-default.xml (see 
mapred.child.java.opts parameter) to change the heap settings.

Amar

Some reduce tasks are failing with the following errors:
1)
stderr
Exception in thread org.apache.hadoop.io.ObjectWritable Connection 
Culler Exception in thread 
[EMAIL PROTECTED] 
java.lang.OutOfMemoryError: Java heap space
Exception in thread IPC Client connection to /127.0.0.1:34691 
java.lang.OutOfMemoryError: Java heap space

Exception in thread main java.lang.OutOfMemoryError: Java heap space

2)
stderr
Exception in thread org.apache.hadoop.io.ObjectWritable Connection 
Culler java.lang.OutOfMemoryError: Java heap space


syslog:
2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask: 
task_200804212359_0007_r_04_0 Merge of the 19 files in 
InMemoryFileSystem complete. Local file is /data/hadoop-im2/mapred/loca

l/task_200804212359_0007_r_04_0/map_22600.out
2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client: 
java.net.SocketException: Socket closed

   at java.net.SocketInputStream.read(SocketInputStream.java:162)
   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)

2008-04-22 20:34:16,032 WARN org.apache.hadoop.mapred.TaskTracker: 
Error running child

java.lang.OutOfMemoryError: Java heap space
2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner: 
Communication exception: java.lang.OutOfMemoryError: Java heap space


Has anyone experienced similar problem ?   Is there any configuration 
change that can help resolve this issue.


Regards,
aj







Re: User accounts in Master and Slaves

2008-04-23 Thread Sridhar Raman
Ok, what about the issue regarding the users?  Do all the machines need to
be under the same user?

On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi 
[EMAIL PROTECTED] wrote:

 On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED]
 wrote:

  After trying out Hadoop in a single machine, I decided to run a
 MapReduce
  across multiple machines.  This is the approach I followed:
  1 Master
  1 Slave
 
  (A doubt here:  Can my Master also be used to execute the Map/Reduce
  functions?)
 

 If you add the master node to the list of slaves (conf/slaves), then the
 master node run will also run a TaskTracker.


 
  To do this, I set up the masters and slaves files in the conf directory.
  Following the instructions in this page -
  http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had set
  up
  sshd in both the machines, and was able to ssh from one to the other.
 
  I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a
 password
  for [EMAIL PROTECTED], while in slave, there was only user2.  While in 
  master,
  user1 was the logged on user.  How do I resolve this?  Should the user
  accounts be present in all the machines?  Or can I specify this
 somewhere?
 



 --
 Harish Mallipeddi
 circos.com : poundbang.in/blog/



Re: submitting map-reduce jobs without creating jar file ?

2008-04-23 Thread Torsten Curdt


On Apr 23, 2008, at 00:31, Ted Dunning wrote:


Grool might help you.


Got a link? Google is not very helpful on the Grool + Groovy search.

cheers
--
Torsten



Simple SetWritable class

2008-04-23 Thread CloudyEye

I tried to extend TreeSet to be Wrtiable. Here is what I did.

public class SetWritable extends TreeSetInteger implements
WritableComparable {

public void readFields(DataInput in) throws IOException {
clear();
int sz=in.readInt();
for (int i = 0; i  sz; i++) {
add(in.readInt());
}
}

public void write(DataOutput out) throws IOException {
out.writeInt(size());
IteratorInteger iter=this.iterator();
while (iter.hasNext()) {
out.writeInt(iter.next());
}
}   
}


 If I remove clear() from the readFields() I am gettting wrong output.(some
old data is written with the new ones !). With clear() it is ok as my
simple tests show.

Is this implementaion save to be used ?

Regards,



-- 
View this message in context: 
http://www.nabble.com/Simple-SetWritable-class-tp16833976p16833976.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Not able to back up to S3

2008-04-23 Thread Tom White
Part of the problem here is that the error message is confusing. It
looks like there's a problem with the AWS credentials, when in fact
the host name is malformed (but URI isn't telling us). I've created a
patch to make the error message more helpful:
https://issues.apache.org/jira/browse/HADOOP-3301.

Tom

On Fri, Apr 18, 2008 at 11:20 AM, Steve Loughran [EMAIL PROTECTED] wrote:
 Chris K Wensel wrote:

  you cannot have underscores in a bucket name. it freaks out java.net.URI.
 

  freaks out DNS, too, which is why the java.net classes whine. minus signs
 should work

  --
  Steve Loughran  http://www.1060.org/blogxter/publish/5
  Author: Ant in Action   http://antbook.org/



Re: Appending to Input Path after mapping has begun

2008-04-23 Thread Mathos Marcer
Though I have only spent a couple days reviewing the code, it seems
the crux of the problem is in the InputFormat interface in that
getSplits is only called at the initiation of the map/reduce job; it
would seem that if this method was more iterable in implementation
like a getNextSplits you could have a way to add more files to the
pipeline while in process.

==
mathos

On Wed, Apr 23, 2008 at 1:32 AM, Owen O'Malley [EMAIL PROTECTED] wrote:


  On Apr 22, 2008, at 11:01 AM, Thomas Cramer wrote:


  Is it possible or how may one add to the input path after mapping has
  begun?  More specifically say my Map process creates more files to
  needing to Map and you don't want to have to keep re-initiating
  Map/Reduce processes.  I tried simply creating files in the InputPath
  directory.  I have also pulled the JobConf object into my map process
  and issued an addInputPath but apparently it doesn't effect the
  process after it is running.  Any thoughts or options?
 

  No, it isn't currently possible. I can imagine an extension to the
 framework that let's you add new input splits to a job after it has started,
 but it would be a lot of work to get it right. The primary advantage of such
 a system would be that you could increase the efficiency of a pipeline of
 map/reduce jobs.

  -- Owen



Re: Hadoop summit video capture?

2008-04-23 Thread Jeremy Zawodny
Certainly...

Stay tuned.

Jeremy

On 4/22/08, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Jeremy,

 Any chance that these videos could be made in a downloadable format rather
 than thru Y!'s player?

 For example I'm traveling right now and would love to watch the rest of
 the
 presentations but the next few hours I won't have an internet connection.

 So, my request won't help me, but may help folks in similar situations.

 Just a thought, thanks!

 Cheers,
   Chris



 On 4/22/08 1:27 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote:

  Okay, things appear to be fixed now.
 
  Jeremy
 
  On 4/20/08, Jeremy Zawodny [EMAIL PROTECTED] wrote:
 
  Not yet... there seem to be a lot of cooks in the kitchen on this one,
 but
  we'll get it fixed.
 
  Jeremy
 
  On 4/19/08, Cole Flournoy [EMAIL PROTECTED] wrote:
 
  Any news on when the videos are going to work?  I am dieing to watch
  them!
 
  Cole
 
  On Fri, Apr 18, 2008 at 8:10 PM, Jeremy Zawodny [EMAIL PROTECTED]
  wrote:
 
  Almost... The videos and slides are up (as of yesterday) but there
  appears
  to be an ACL problem with the videos.
 
 
 
 http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_vi
  deo.html
 
  Jeremy
 
  On 4/17/08, wuqi [EMAIL PROTECTED] wrote:
 
  Are the videos and slides available now?
 
 
  - Original Message -
  From: Jeremy Zawodny [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
 
  Cc: [EMAIL PROTECTED]
  Sent: Thursday, March 27, 2008 11:01 AM
  Subject: Re: Hadoop summit video capture?
 
 
  Slides and video go up next week.  It just takes a few days to
  assemble.
 
  We're glad everyone enjoyed it and was okay with a last minute
  venue
  change.
 
  Thanks also to Amazon.com and the NSF (not NFS as I typo'd on the
  printed
  agenda!)
 
  Jeremy
 
  On 3/26/08, Cam Bazz [EMAIL PROTECTED] wrote:
 
  Yes, are there any materials for those who could not come to
  summit? I
  am
  really curious about this summit.
 
  Is the material posted on the hadoop page?
 
  Best Regards,
  -C.A.
 
  On Wed, Mar 26, 2008 at 8:48 AM, Isabel Drost 
  [EMAIL PROTECTED]
  wrote:
 
 
  On Wednesday 26 March 2008, Jeff Eastman wrote:
  I personally got a lot of positive feedback and interest in
  Mahout,
  so
  expect your inbox to explode in the next couple of days.
 
  Sounds great. I was already happy we received quite some
  traffic
  after
  we
  published that we would take part in the GSoC.
 
  Isabel
 
  --
  kernel, n.: A part of an operating system that preserves
  the
  medieval
  traditions of sorcery and black art.
   |\  _,,,---,,_   Web:   http://www.isabel-drost.de
   /,`.-'`'-.  ;-;;,_
   |,4-  ) )-,_..;\ (  `'-'
  '---''(_/--'  `-'\_) (fL)  IM:  xmpp://[EMAIL PROTECTED]
 
 
 
 
 
 
 


 __
 Chris Mattmann, Ph.D.
 [EMAIL PROTECTED]
 Cognizant Development Engineer
 Early Detection Research Network Project
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266B Mailstop:  171-246
 ___

 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.





Re: Hadoop summit video capture?

2008-04-23 Thread Chris Mattmann
Thanks, Jeremy. Appreciate it.

Cheers,
 Chris



On 4/23/08 8:25 AM, Jeremy Zawodny [EMAIL PROTECTED] wrote:

 Certainly...
 
 Stay tuned.
 
 Jeremy
 
 On 4/22/08, Chris Mattmann [EMAIL PROTECTED] wrote:
 
 Hi Jeremy,
 
 Any chance that these videos could be made in a downloadable format rather
 than thru Y!'s player?
 
 For example I'm traveling right now and would love to watch the rest of
 the
 presentations but the next few hours I won't have an internet connection.
 
 So, my request won't help me, but may help folks in similar situations.
 
 Just a thought, thanks!
 
 Cheers,
   Chris
 
 
 
 On 4/22/08 1:27 PM, Jeremy Zawodny [EMAIL PROTECTED] wrote:
 
 Okay, things appear to be fixed now.
 
 Jeremy
 
 On 4/20/08, Jeremy Zawodny [EMAIL PROTECTED] wrote:
 
 Not yet... there seem to be a lot of cooks in the kitchen on this one,
 but
 we'll get it fixed.
 
 Jeremy
 
 On 4/19/08, Cole Flournoy [EMAIL PROTECTED] wrote:
 
 Any news on when the videos are going to work?  I am dieing to watch
 them!
 
 Cole
 
 On Fri, Apr 18, 2008 at 8:10 PM, Jeremy Zawodny [EMAIL PROTECTED]
 wrote:
 
 Almost... The videos and slides are up (as of yesterday) but there
 appears
 to be an ACL problem with the videos.
 
 
 
 http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_vi
 deo.html
 
 Jeremy
 
 On 4/17/08, wuqi [EMAIL PROTECTED] wrote:
 
 Are the videos and slides available now?
 
 
 - Original Message -
 From: Jeremy Zawodny [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 
 Cc: [EMAIL PROTECTED]
 Sent: Thursday, March 27, 2008 11:01 AM
 Subject: Re: Hadoop summit video capture?
 
 
 Slides and video go up next week.  It just takes a few days to
 assemble.
 
 We're glad everyone enjoyed it and was okay with a last minute
 venue
 change.
 
 Thanks also to Amazon.com and the NSF (not NFS as I typo'd on the
 printed
 agenda!)
 
 Jeremy
 
 On 3/26/08, Cam Bazz [EMAIL PROTECTED] wrote:
 
 Yes, are there any materials for those who could not come to
 summit? I
 am
 really curious about this summit.
 
 Is the material posted on the hadoop page?
 
 Best Regards,
 -C.A.
 
 On Wed, Mar 26, 2008 at 8:48 AM, Isabel Drost 
 [EMAIL PROTECTED]
 wrote:
 
 
 On Wednesday 26 March 2008, Jeff Eastman wrote:
 I personally got a lot of positive feedback and interest in
 Mahout,
 so
 expect your inbox to explode in the next couple of days.
 
 Sounds great. I was already happy we received quite some
 traffic
 after
 we
 published that we would take part in the GSoC.
 
 Isabel
 
 --
 kernel, n.: A part of an operating system that preserves
 the
 medieval
 traditions of sorcery and black art.
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_
  |,4-  ) )-,_..;\ (  `'-'
 '---''(_/--'  `-'\_) (fL)  IM:  xmpp://[EMAIL PROTECTED]
 
 
 
 
 
 
 
 
 
 __
 Chris Mattmann, Ph.D.
 [EMAIL PROTECTED]
 Cognizant Development Engineer
 Early Detection Research Network Project
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266B Mailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: User accounts in Master and Slaves

2008-04-23 Thread Norbert Burger
Yes, this is the suggested configuration.  Hadoop relies on password-less
SSH to be able to start tasks on slave machines.  You can find instructions
on creating/transferring the SSH keys here:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29

On Wed, Apr 23, 2008 at 4:39 AM, Sridhar Raman [EMAIL PROTECTED]
wrote:

 Ok, what about the issue regarding the users?  Do all the machines need to
 be under the same user?

 On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi 
 [EMAIL PROTECTED] wrote:

  On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman [EMAIL PROTECTED]
  wrote:
 
   After trying out Hadoop in a single machine, I decided to run a
  MapReduce
   across multiple machines.  This is the approach I followed:
   1 Master
   1 Slave
  
   (A doubt here:  Can my Master also be used to execute the Map/Reduce
   functions?)
  
 
  If you add the master node to the list of slaves (conf/slaves), then the
  master node run will also run a TaskTracker.
 
 
  
   To do this, I set up the masters and slaves files in the conf
 directory.
   Following the instructions in this page -
   http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had
 set
   up
   sshd in both the machines, and was able to ssh from one to the other.
  
   I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a
  password
   for [EMAIL PROTECTED], while in slave, there was only user2.  While in
 master,
   user1 was the logged on user.  How do I resolve this?  Should the user
   accounts be present in all the machines?  Or can I specify this
  somewhere?
  
 
 
 
  --
  Harish Mallipeddi
  circos.com : poundbang.in/blog/
 



Best practices for handling many small files

2008-04-23 Thread Stuart Sierra
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single record?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org


Re: Can Hadoop process pictures and videos?

2008-04-23 Thread Ted Dunning

It's also easy to launch processes from Java.  If you have unsplittable
input files, Java can read the entire file and pass it to the child process.


On 4/22/08 11:49 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 
 On Apr 22, 2008, at 11:44 PM, Roland Rabben wrote:
 
 Thanks
 Do you have any pointers to how you would normally approach this? Is
 Hadoop Streaming the solution I am looking for?
 
 Probably not, given that Hadoop streaming doesn't work well with
 binary data. It is probably easiest to the use the C++ interface.
 Look at the examples in src/examples/pipes.
 
 -- Owen



Re: submitting map-reduce jobs without creating jar file ?

2008-04-23 Thread Ted Dunning

I haven't distributed it formally yet.

If you would like a tarball, I would be happy to send it.


On 4/23/08 1:43 AM, Torsten Curdt [EMAIL PROTECTED] wrote:

 
 On Apr 23, 2008, at 00:31, Ted Dunning wrote:
 
 Grool might help you.
 
 Got a link? Google is not very helpful on the Grool + Groovy search.
 
 cheers
 --
 Torsten
 



Re: Best practices for handling many small files

2008-04-23 Thread Ted Dunning

Yes.  That (2) should work well.


On 4/23/08 8:55 AM, Stuart Sierra [EMAIL PROTECTED] wrote:

 Hello all, Hadoop newbie here, asking: what's the preferred way to
 handle large (~1 million) collections of small files (10 to 100KB) in
 which each file is a single record?
 
 1. Ignore it, let Hadoop create a million Map processes;
 2. Pack all the files into a single SequenceFile; or
 3. Something else?
 
 I started writing code to do #2, transforming a big tar.bz2 into a
 BLOCK-compressed SequenceFile, with the file names as keys.  Will that
 work?
 
 Thanks,
 -Stuart, altlaw.org



RE: Best practices for handling many small files

2008-04-23 Thread Joydeep Sen Sarma
million map processes are horrible. aside from overhead - don't do it if u 
share the cluster with other jobs (all other jobs will get killed whenever the 
million map job is finished - see 
https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be 
parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of 
multiple files in a single map job. it needs improvement. For one - it's an 
abstract class - and a concrete implementation for (at least)  text files would 
help. also - the splitting logic is not very smart (from what i last saw). 
ideally - it should take the million files and form it into N groups (say N is 
size of your cluster) where each group has files local to the Nth machine and 
then process them on that machine. currently it doesn't do this (the groups are 
arbitrary). But it's still the way to go ..


-Original Message-
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
 
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single record?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org



Re: submitting map-reduce jobs without creating jar file ?

2008-04-23 Thread Ted Dunning

I need some advice/help on how it should be structured as a contrib module.


On 4/23/08 9:31 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Ted Dunning wrote:
 I haven't distributed it formally yet.
 
 If you would like a tarball, I would be happy to send it.
 
 Can you attach it to a Jira issue?  Then we can target it for a contrib
 module or somesuch.
 
 Doug



Re: submitting map-reduce jobs without creating jar file ?

2008-04-23 Thread Ted Dunning

Just added it to Hadoop 2781.

See here 


 [ 
https://issues.apache.org/jira/browse/HADOOP-2781?page=com.atlassian.jira.pl
ugin.system.issuetabpanels:all-tabpanel ]




On 4/23/08 9:35 AM, Torsten Curdt [EMAIL PROTECTED] wrote:

 Ah, OK
 
 Well, bring it on :-)
 
 cheers
 --
 Torsten
 
 On Apr 23, 2008, at 18:06, Ted Dunning wrote:
 
 
 I haven't distributed it formally yet.
 
 If you would like a tarball, I would be happy to send it.
 
 
 On 4/23/08 1:43 AM, Torsten Curdt [EMAIL PROTECTED] wrote:
 
 
 On Apr 23, 2008, at 00:31, Ted Dunning wrote:
 
 Grool might help you.
 
 Got a link? Google is not very helpful on the Grool + Groovy
 search.
 
 cheers
 --
 Torsten
 
 
 



RE: How to instruct Job Tracker to use certain hosts only

2008-04-23 Thread Htin Hlaing
Thanks Owen for the suggestion.  I wonder if there would be side effects
from failing the job from the node consistently? Would job tracker black
list the nodes for other jobs as well?

Htin

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Monday, April 21, 2008 10:53 PM
To: core-user@hadoop.apache.org
Subject: Re: How to instruct Job Tracker to use certain hosts only


On Apr 18, 2008, at 1:52 PM, Htin Hlaing wrote:

 I would like to run the first job to run on all the compute hosts  
 in the
 cluster (which is by default) and then, I would like to run the  
 second job
 with only on  a subset of the hosts (due to some licensing issue).

One option would be to set mapred.map.max.attempts and  
mapred.reduce.max.attempts to larger numbers and have the map or  
reduce fail if it is run on a bad node. When the task re-runs, it  
will run on a different node. Eventually it will find a valid node.

-- Owen



Re: Best practices for handling many small files

2008-04-23 Thread Chris K Wensel
are the files to be stored on HDFS long term, or do they need to be  
fetched from an external authoritative source?


depending on how things are setup in your datacenter etc...

you could aggregate them into a fat sequence file (or a few). keep in  
mind how long it would take to fetch the files and aggregate them  
(this is a serial process) and if the corpus changes often (how often  
will you need to make these sequence files).


another option is to make a manifest (list of docs to fetch), feed  
that to your mapper and have it fetch each file individually. this  
would be useful if the corpus is reasonably arbitrary between runs and  
could eliminate much of the load time. but painful if the data is  
external to your datacenter and the cost to refetch is high.


there really is no simple answer..

ckw


On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:
million map processes are horrible. aside from overhead - don't do  
it if u share the cluster with other jobs (all other jobs will get  
killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)


well - even for #2 - it begs the question of how the packing itself  
will be parallelized ..


There's a MultiFileInputFormat that can be extended - that allows  
processing of multiple files in a single map job. it needs  
improvement. For one - it's an abstract class - and a concrete  
implementation for (at least)  text files would help. also - the  
splitting logic is not very smart (from what i last saw). ideally -  
it should take the million files and form it into N groups (say N is  
size of your cluster) where each group has files local to the Nth  
machine and then process them on that machine. currently it doesn't  
do this (the groups are arbitrary). But it's still the way to go ..



-Original Message-
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single record?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Error in start up

2008-04-23 Thread Aayush Garg
I put my username to R61neptun as you suggested but I am still getting that
error:

localhost: starting datanode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-datanode-R61neptun.out
localhost: starting secondarynamenode, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-secondarynamenode-R61neptun.out
localhost: Exception in thread main java.lang.IllegalArgumentException:
port out of range:-1
localhost:  at
java.net.InetSocketAddress.init(InetSocketAddress.java:118)
localhost:  at
org.apache.hadoop.dfs.DataNode.createSocketAddr(DataNode.java:104)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.init(SecondaryNameNode.java:94)
localhost:  at
org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:481)
starting jobtracker, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-jobtracker-R61neptun.out
localhost: starting tasktracker, logging to
/home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-garga-tasktracker-R61neptun.out

Could anyone tell about this error? I am just trying to run hadoop in pseudo
distributed mode.

Thanks,


On Tue, Apr 22, 2008 at 11:57 PM, Sujee Maniyam [EMAIL PROTECTED] wrote:


  logs/hadoop-root-datanode-R61-neptun.out

 May be this will help you:

 I am guessing - from the log file name above - that your hostname has
 underscores/dashes. (e.g.   R61-neptune).  Could you try to use the
 hostname
 without I underscores?  (e.g.   R61neptune or even simple 'hadoop').

 I had the same problem with Hadooop v0.16.3.  My hostnames were
 'hadoop_master / hadoop_slave'.  And I was getting the 'Port of Out of
 Range
 -1' exception.  Once I eliminated the underscores (e.g.  master / slave)
 it
 started working.

 thanks

 --
 View this message in context:
 http://www.nabble.com/Error-in-start-up-tp16783362p16826259.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Benchmarking and Statistics for Hadoop Distributed File System

2008-04-23 Thread Cagdas Gerede
Does any body aware of any benchmarking of hadoop distributed file system?

Some numbers I am interested in are
- How long does it take for master to recover if there are, say, 1 million
blocks in the system?
- How does the recovery time change as the number of blocks in the system
change? Is it linear? Is it exponential?
- What is the file read/write throughput of Hadoop File System with
different configurations and loads?



-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


Please Help: Namenode Safemode

2008-04-23 Thread Cagdas Gerede
I have a hadoop distributed file system with 3 datanodes. I only have 150
blocks in each datanode. It takes a little more than a minute for namenode
to start and pass safemode phase.

The steps for namenode start, as much as I understand, are:
1) Datanode send a heartbeat to namenode. Namenode tells datanode to send
blockreport as a piggyback to heartbeat.
2) Datanode computes the block report.
3) Datanode sends it to Namenode.
4) Namenode processes the block report.
5) Namenode safe mode thread monitor checks for exiting, and namenode exist
if threshold is reached and the extension time is passed.

Here are my numbers:
Step 1) Datanodes send heartbeats every 3 seconds.
Step 2) Datanode computes the block report. (this takes about 20 miliseconds
- as shown in the datanodes' logs)
Step 3) No idea? (Depends on the size of blockreport. I suspect this should
not be more than a couple of seconds).
Step 4) No idea? Shouldn't be more than a couple of seconds.
Step 5) Thread checks every second. The extension value in my configuration
is 0. So there is no wait if threshold is achieved.

Given these numbers, can any body explain where does one minute come from?
Shouldn't this step take 10-20 seconds?
Please help. I am very confused.



-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


Mounting HDFS in Linux using FUSE

2008-04-23 Thread Don Lexdel Gasmen
Hi!

I have recently downloaded the latest version of Fuse-DSF.  I would like to
ask if anyone can help me in using it / mounting to Linux. Need help asap.
thnx

Currently I have in my machine:

OS: Ubuntu ver. 7.10
Hadop: 0.16.3
Java: 1.6.3
Fuse: 2.7.3