Re: OutOfMemory error processing large amounts of gz files

2009-02-26 Thread Arun C Murthy


On Feb 24, 2009, at 4:03 PM, bzheng wrote:





2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
java.lang.OutOfMemoryError: Java heap space



That tells that that your TaskTracker is running out of memory, not  
your reduce tasks.


I think you are hitting http://issues.apache.org/jira/browse/ 
HADOOP-4906.


What version of hadoop are you running?

Arun



Re: Can anyone verify Hadoop FS shell command return codes?

2009-02-26 Thread Mikhail Yakshin
On Mon, Feb 23, 2009 at 4:02 PM, S D wrote:
 I'm attempting to use Hadoop FS shell (http://hadoop
 .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My
 challenge is that I'm unable to get the function return value of the
 commands I'm invoking. As an example, I try to run get as follows

 hadoop fs -get /user/hadoop/testFile.txt .

 From the command line this generally works but I need to be able to verify
 that it is working during execution in my ruby script. The command should
 return 0 on success and -1 on error. Based on

 http://pasadenarb.com/2007/03/ruby-shell-commands.html

 I am using backticks to make the hadoop call and get the return value. Here
 is a dialogue within irb (Ruby's interactive shell) in which the command was
 not successful:

 irb(main):001:0 `hadoop dfs -get testFile.txt .`
 get: null
 = 

 and a dialogue within irb in which the command was successful

 irb(main):010:0 `hadoop dfs -get testFile.txt .`
 = 

 In both cases, neither a 0 nor a 1 appeared as a return value; indeed
 nothing was returned. Can anyone who is using the FS command shell return
 values within any scripting language (Ruby, PHP, Perl, ...) please confirm
 that it is working as expected or send an example snippet?

You seem to confuse captured output of stdout and exit status. Try
analyzing $?.exitstatus in Ruby:

irb(main):001:0 `true`
= 
irb(main):002:0 $?.exitstatus
= 0
irb(main):003:0 `false`
= 
irb(main):004:0 $?.exitstatus
= 1

-- 
WBR, Mikhail Yakshin


Re: OutOfMemory error processing large amounts of gz files

2009-02-26 Thread bzheng



Arun C Murthy-2 wrote:
 
 
 On Feb 24, 2009, at 4:03 PM, bzheng wrote:

 
 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
 java.lang.OutOfMemoryError: Java heap space

 
 That tells that that your TaskTracker is running out of memory, not  
 your reduce tasks.
 
 I think you are hitting http://issues.apache.org/jira/browse/ 
 HADOOP-4906.
 
 What version of hadoop are you running?
 
 Arun
 
 
 

I'm using 0.18.2.  We figured that gz may not be the root problem when we
ran a big job not involving any gz files, after about 1.5 hours, we got the
same out of memory problem.  One interesting thing though, if we do use gz
files, the out of memory issues occurs in a few minutes.
-- 
View this message in context: 
http://www.nabble.com/OutOfMemory-error-processing-large-amounts-of-gz-files-tp22193552p22231249.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Eclipse plugin

2009-02-26 Thread John Livingstone

Iman-4,
I have encountered the same problem that you have encountered: Not being
able to access HDFS on my Hadoop VMware Linux server (uning the Hadoop Yahoo
tutorial) and not seeing hadoop.job.ugi in my Eclipse Europa 3.3.2 list of
parameters.  What did you have to do or change to get it to work?
Thanks,
John L.




Iman-4 wrote:
 
 Thank you so much, Norbert. It worked.
 Iman
 Norbert Burger wrote:
 Are running Eclipse on Windows?  If so, be aware that you need to spawn
 Eclipse from within Cygwin in order to access HDFS.  It seems that the
 plugin uses whoami to get info about the active user.  This thread has
 some more info:

 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

 Norbert

 On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:
   
 Hi,
 I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in
 of
 hadoop. I have followed all the steps in this tutorial:
 http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
 problem is that I am not able to browse the HDFS. It only shows an entry
 Error:null. Upload files to DFS, and Create new directory fail. Any
 suggestions? I have tried to chang all the directories in the hadoop
 location advanced parameters to /tmp/hadoop-user, but it did not work.
 Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to
 be
 changed, but I could not find it in the list of parameters.
 Thanks
 Iman

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Eclipse-plugin-tp21983984p22231326.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Shuffle phase

2009-02-26 Thread Nathan Marz
Do the reducers batch copy map outputs from a machine? That is, if a  
machine M has 15 intermediate map outputs destined for machine R, will  
machine R copy the intermediate outputs one at a time or all at once? 


Re: Eclipse plugin

2009-02-26 Thread Iman

Hi John,
When I created the hadoop location, the hadoop.job.ugi did not appear in 
the advanced parameter. But when I later edited it, it was there. I 
donnu how that was fixed:)
Also to get it to work, I had to edit the fs.default.name and 
mapred.job.tracker in  hadoop/conf/hadoop-site.xml

I added these lines:
property
   namefs.default.name/name
   valuehdfs://ip_address:9000/value
 /property
 property
   namemapred.job.tracker/name
   valueip_address:9001/value
 /property
 property
   namedfs.replication/name
   value1/value
 /property

Finally, I decided to install hadoop locally on my machine instead of 
using the hadoop virtual machine.

Iman.

John Livingstone wrote:

Iman-4,
I have encountered the same problem that you have encountered: Not being
able to access HDFS on my Hadoop VMware Linux server (uning the Hadoop Yahoo
tutorial) and not seeing hadoop.job.ugi in my Eclipse Europa 3.3.2 list of
parameters.  What did you have to do or change to get it to work?
Thanks,
John L.




Iman-4 wrote:
  

Thank you so much, Norbert. It worked.
Iman
Norbert Burger wrote:


Are running Eclipse on Windows?  If so, be aware that you need to spawn
Eclipse from within Cygwin in order to access HDFS.  It seems that the
plugin uses whoami to get info about the active user.  This thread has
some more info:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

Norbert

On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:
  
  

Hi,
I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in
of
hadoop. I have followed all the steps in this tutorial:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
problem is that I am not able to browse the HDFS. It only shows an entry
Error:null. Upload files to DFS, and Create new directory fail. Any
suggestions? I have tried to chang all the directories in the hadoop
location advanced parameters to /tmp/hadoop-user, but it did not work.
Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to
be
changed, but I could not find it in the list of parameters.
Thanks
Iman



  
  





  




Re: Shuffle phase

2009-02-26 Thread Owen O'Malley


On Feb 26, 2009, at 2:03 PM, Nathan Marz wrote:

Do the reducers batch copy map outputs from a machine? That is, if a  
machine M has 15 intermediate map outputs destined for machine R,  
will machine R copy the intermediate outputs one at a time or all at  
once?


Currently, one at a time. In 0.21 it will be batched up.

-- Owen


Atomicity of file operations?

2009-02-26 Thread Brian Long
What kind of atomicity/visibility claims are made regarding the various
operations on a FileSystem?
I have multiple processes that write into local sequence files, then uploads
them into a remote directory in HDFS. A map/reduce job runs which operates
on whatever is in the directory. The processes are not synchronized with the
job, so it is entirely possible that the job might start as a file is being
uploaded. Thus, my concern is that the job may include a partially uploaded
file if FileSystem.copyFromLocalFile is not atomic (in the sense that the
file will not appear until all bytes are written).

Are any of the FileSystem API's atomic in this sense? What about, at the
very least, rename (e.g. first write to a temp hdfs location, then use
rename to atomically flip the file into the live directory)?

Thanks,
Brian


Announcing CloudBase-1.2 release

2009-02-26 Thread Tarandeep Singh
Hi,

We have released 1.2 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/

[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is released to open source
community under GNU GPL license]

Please give it a try and send us your feedback on CloudBase users group-
http://groups.google.com/group/cloudbase-users

Thanks,
Tarandeep

Release notes-
---
New Features:
* User Defined Functions (UDFs)- User can create functions in Java
programming language and call them from SQL
* Table indexing- One can create index on columns of a table to reduce query
execution time
* ORDER BY improvements- Use all machines in the cluster to perform sorting.
This is done via Sampling data and sending keys to correct partitioners.
* TRUNCATE statement- truncate statement to delete all data of a table.

Bug fixes:
* CloudBase was not working with Hadoop-0.19 or later version
* Full outer join was not working
* New jars copied into $CLOUDBASE_HOME/lib directory are not picked for next
query execution


Online documentation has been updated with new features-
http://cloudbase.sourceforge.net/index.html#userDoc


Re: Atomicity of file operations?

2009-02-26 Thread Brian Bockelman


On Feb 26, 2009, at 4:14 PM, Brian Long wrote:

What kind of atomicity/visibility claims are made regarding the  
various

operations on a FileSystem?
I have multiple processes that write into local sequence files, then  
uploads
them into a remote directory in HDFS. A map/reduce job runs which  
operates
on whatever is in the directory. The processes are not synchronized  
with the
job, so it is entirely possible that the job might start as a file  
is being
uploaded. Thus, my concern is that the job may include a partially  
uploaded
file if FileSystem.copyFromLocalFile is not atomic (in the sense  
that the

file will not appear until all bytes are written).


Hey Brian,

I can't speak for knowing about the whole file system, but I do know  
that, like you'd expect in Unix, open files which are being written to  
are visible.





Are any of the FileSystem API's atomic in this sense? What about, at  
the

very least, rename (e.g. first write to a temp hdfs location, then use
rename to atomically flip the file into the live directory)?



I'm not sure on this one; I suspect you're safe here.

Brian


How to deal with HDFS failures properly

2009-02-26 Thread Brian Long
I'm wondering what the proper actions to take in light of a NameNode or
DataNode failure are in an application which is holding a reference to a
FileSystem object.
* Does the FileSystem handle all of this itself (e.g. reconnect logic)?
* Do I need to get a new FileSystem using .get(Configuration)?
* Does the FileSystem need to be closed before re-getting?
* Do the answers to these questions depend on whether it's a NameNode or
DataNode that's failed?

In short, how does an application (not a Hadoop job -- just an app using
HDFS) properly recover from a NameNode or DataNode failure? I haven't
figured out the magic juju yet and my applications are not handling DFS
outages gracefully.

Thanks,
Brian


Re: Atomicity of file operations?

2009-02-26 Thread Brian Long
Thanks Brian. I will go with the copy to tmp and flip with rename model.
-B

On Thu, Feb 26, 2009 at 3:49 PM, Brian Bockelman bbock...@cse.unl.eduwrote:


 On Feb 26, 2009, at 4:14 PM, Brian Long wrote:

  What kind of atomicity/visibility claims are made regarding the various
 operations on a FileSystem?
 I have multiple processes that write into local sequence files, then
 uploads
 them into a remote directory in HDFS. A map/reduce job runs which operates
 on whatever is in the directory. The processes are not synchronized with
 the
 job, so it is entirely possible that the job might start as a file is
 being
 uploaded. Thus, my concern is that the job may include a partially
 uploaded
 file if FileSystem.copyFromLocalFile is not atomic (in the sense that
 the
 file will not appear until all bytes are written).


 Hey Brian,

 I can't speak for knowing about the whole file system, but I do know that,
 like you'd expect in Unix, open files which are being written to are
 visible.



 Are any of the FileSystem API's atomic in this sense? What about, at the
 very least, rename (e.g. first write to a temp hdfs location, then use
 rename to atomically flip the file into the live directory)?


 I'm not sure on this one; I suspect you're safe here.

 Brian