Re: OutOfMemory error processing large amounts of gz files
On Feb 24, 2009, at 4:03 PM, bzheng wrote: 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker: java.lang.OutOfMemoryError: Java heap space That tells that that your TaskTracker is running out of memory, not your reduce tasks. I think you are hitting http://issues.apache.org/jira/browse/ HADOOP-4906. What version of hadoop are you running? Arun
Re: Can anyone verify Hadoop FS shell command return codes?
On Mon, Feb 23, 2009 at 4:02 PM, S D wrote: I'm attempting to use Hadoop FS shell (http://hadoop .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My challenge is that I'm unable to get the function return value of the commands I'm invoking. As an example, I try to run get as follows hadoop fs -get /user/hadoop/testFile.txt . From the command line this generally works but I need to be able to verify that it is working during execution in my ruby script. The command should return 0 on success and -1 on error. Based on http://pasadenarb.com/2007/03/ruby-shell-commands.html I am using backticks to make the hadoop call and get the return value. Here is a dialogue within irb (Ruby's interactive shell) in which the command was not successful: irb(main):001:0 `hadoop dfs -get testFile.txt .` get: null = and a dialogue within irb in which the command was successful irb(main):010:0 `hadoop dfs -get testFile.txt .` = In both cases, neither a 0 nor a 1 appeared as a return value; indeed nothing was returned. Can anyone who is using the FS command shell return values within any scripting language (Ruby, PHP, Perl, ...) please confirm that it is working as expected or send an example snippet? You seem to confuse captured output of stdout and exit status. Try analyzing $?.exitstatus in Ruby: irb(main):001:0 `true` = irb(main):002:0 $?.exitstatus = 0 irb(main):003:0 `false` = irb(main):004:0 $?.exitstatus = 1 -- WBR, Mikhail Yakshin
Re: OutOfMemory error processing large amounts of gz files
Arun C Murthy-2 wrote: On Feb 24, 2009, at 4:03 PM, bzheng wrote: 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker: java.lang.OutOfMemoryError: Java heap space That tells that that your TaskTracker is running out of memory, not your reduce tasks. I think you are hitting http://issues.apache.org/jira/browse/ HADOOP-4906. What version of hadoop are you running? Arun I'm using 0.18.2. We figured that gz may not be the root problem when we ran a big job not involving any gz files, after about 1.5 hours, we got the same out of memory problem. One interesting thing though, if we do use gz files, the out of memory issues occurs in a few minutes. -- View this message in context: http://www.nabble.com/OutOfMemory-error-processing-large-amounts-of-gz-files-tp22193552p22231249.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Eclipse plugin
Iman-4, I have encountered the same problem that you have encountered: Not being able to access HDFS on my Hadoop VMware Linux server (uning the Hadoop Yahoo tutorial) and not seeing hadoop.job.ugi in my Eclipse Europa 3.3.2 list of parameters. What did you have to do or change to get it to work? Thanks, John L. Iman-4 wrote: Thank you so much, Norbert. It worked. Iman Norbert Burger wrote: Are running Eclipse on Windows? If so, be aware that you need to spawn Eclipse from within Cygwin in order to access HDFS. It seems that the plugin uses whoami to get info about the active user. This thread has some more info: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e Norbert On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote: Hi, I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of hadoop. I have followed all the steps in this tutorial: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My problem is that I am not able to browse the HDFS. It only shows an entry Error:null. Upload files to DFS, and Create new directory fail. Any suggestions? I have tried to chang all the directories in the hadoop location advanced parameters to /tmp/hadoop-user, but it did not work. Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be changed, but I could not find it in the list of parameters. Thanks Iman -- View this message in context: http://www.nabble.com/Eclipse-plugin-tp21983984p22231326.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Shuffle phase
Do the reducers batch copy map outputs from a machine? That is, if a machine M has 15 intermediate map outputs destined for machine R, will machine R copy the intermediate outputs one at a time or all at once?
Re: Eclipse plugin
Hi John, When I created the hadoop location, the hadoop.job.ugi did not appear in the advanced parameter. But when I later edited it, it was there. I donnu how that was fixed:) Also to get it to work, I had to edit the fs.default.name and mapred.job.tracker in hadoop/conf/hadoop-site.xml I added these lines: property namefs.default.name/name valuehdfs://ip_address:9000/value /property property namemapred.job.tracker/name valueip_address:9001/value /property property namedfs.replication/name value1/value /property Finally, I decided to install hadoop locally on my machine instead of using the hadoop virtual machine. Iman. John Livingstone wrote: Iman-4, I have encountered the same problem that you have encountered: Not being able to access HDFS on my Hadoop VMware Linux server (uning the Hadoop Yahoo tutorial) and not seeing hadoop.job.ugi in my Eclipse Europa 3.3.2 list of parameters. What did you have to do or change to get it to work? Thanks, John L. Iman-4 wrote: Thank you so much, Norbert. It worked. Iman Norbert Burger wrote: Are running Eclipse on Windows? If so, be aware that you need to spawn Eclipse from within Cygwin in order to access HDFS. It seems that the plugin uses whoami to get info about the active user. This thread has some more info: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e Norbert On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote: Hi, I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of hadoop. I have followed all the steps in this tutorial: http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My problem is that I am not able to browse the HDFS. It only shows an entry Error:null. Upload files to DFS, and Create new directory fail. Any suggestions? I have tried to chang all the directories in the hadoop location advanced parameters to /tmp/hadoop-user, but it did not work. Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be changed, but I could not find it in the list of parameters. Thanks Iman
Re: Shuffle phase
On Feb 26, 2009, at 2:03 PM, Nathan Marz wrote: Do the reducers batch copy map outputs from a machine? That is, if a machine M has 15 intermediate map outputs destined for machine R, will machine R copy the intermediate outputs one at a time or all at once? Currently, one at a time. In 0.21 it will be batched up. -- Owen
Atomicity of file operations?
What kind of atomicity/visibility claims are made regarding the various operations on a FileSystem? I have multiple processes that write into local sequence files, then uploads them into a remote directory in HDFS. A map/reduce job runs which operates on whatever is in the directory. The processes are not synchronized with the job, so it is entirely possible that the job might start as a file is being uploaded. Thus, my concern is that the job may include a partially uploaded file if FileSystem.copyFromLocalFile is not atomic (in the sense that the file will not appear until all bytes are written). Are any of the FileSystem API's atomic in this sense? What about, at the very least, rename (e.g. first write to a temp hdfs location, then use rename to atomically flip the file into the live directory)? Thanks, Brian
Announcing CloudBase-1.2 release
Hi, We have released 1.2 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net/ [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is released to open source community under GNU GPL license] Please give it a try and send us your feedback on CloudBase users group- http://groups.google.com/group/cloudbase-users Thanks, Tarandeep Release notes- --- New Features: * User Defined Functions (UDFs)- User can create functions in Java programming language and call them from SQL * Table indexing- One can create index on columns of a table to reduce query execution time * ORDER BY improvements- Use all machines in the cluster to perform sorting. This is done via Sampling data and sending keys to correct partitioners. * TRUNCATE statement- truncate statement to delete all data of a table. Bug fixes: * CloudBase was not working with Hadoop-0.19 or later version * Full outer join was not working * New jars copied into $CLOUDBASE_HOME/lib directory are not picked for next query execution Online documentation has been updated with new features- http://cloudbase.sourceforge.net/index.html#userDoc
Re: Atomicity of file operations?
On Feb 26, 2009, at 4:14 PM, Brian Long wrote: What kind of atomicity/visibility claims are made regarding the various operations on a FileSystem? I have multiple processes that write into local sequence files, then uploads them into a remote directory in HDFS. A map/reduce job runs which operates on whatever is in the directory. The processes are not synchronized with the job, so it is entirely possible that the job might start as a file is being uploaded. Thus, my concern is that the job may include a partially uploaded file if FileSystem.copyFromLocalFile is not atomic (in the sense that the file will not appear until all bytes are written). Hey Brian, I can't speak for knowing about the whole file system, but I do know that, like you'd expect in Unix, open files which are being written to are visible. Are any of the FileSystem API's atomic in this sense? What about, at the very least, rename (e.g. first write to a temp hdfs location, then use rename to atomically flip the file into the live directory)? I'm not sure on this one; I suspect you're safe here. Brian
How to deal with HDFS failures properly
I'm wondering what the proper actions to take in light of a NameNode or DataNode failure are in an application which is holding a reference to a FileSystem object. * Does the FileSystem handle all of this itself (e.g. reconnect logic)? * Do I need to get a new FileSystem using .get(Configuration)? * Does the FileSystem need to be closed before re-getting? * Do the answers to these questions depend on whether it's a NameNode or DataNode that's failed? In short, how does an application (not a Hadoop job -- just an app using HDFS) properly recover from a NameNode or DataNode failure? I haven't figured out the magic juju yet and my applications are not handling DFS outages gracefully. Thanks, Brian
Re: Atomicity of file operations?
Thanks Brian. I will go with the copy to tmp and flip with rename model. -B On Thu, Feb 26, 2009 at 3:49 PM, Brian Bockelman bbock...@cse.unl.eduwrote: On Feb 26, 2009, at 4:14 PM, Brian Long wrote: What kind of atomicity/visibility claims are made regarding the various operations on a FileSystem? I have multiple processes that write into local sequence files, then uploads them into a remote directory in HDFS. A map/reduce job runs which operates on whatever is in the directory. The processes are not synchronized with the job, so it is entirely possible that the job might start as a file is being uploaded. Thus, my concern is that the job may include a partially uploaded file if FileSystem.copyFromLocalFile is not atomic (in the sense that the file will not appear until all bytes are written). Hey Brian, I can't speak for knowing about the whole file system, but I do know that, like you'd expect in Unix, open files which are being written to are visible. Are any of the FileSystem API's atomic in this sense? What about, at the very least, rename (e.g. first write to a temp hdfs location, then use rename to atomically flip the file into the live directory)? I'm not sure on this one; I suspect you're safe here. Brian