Re: Multiple file output
No. It is part of branch 0.21 onwards. For 0.20*, people can use old api only, though JobConf is deprecated. -Amareshwari. On 1/6/10 11:52 AM, Vijay tec...@gmail.com wrote: org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the released version of 0.20.1 right? Is this expected to be part of 0.20.2 or later? 2010/1/5 Amareshwari Sri Ramadasu amar...@yahoo-inc.com In branch 0.21, You can get the functionality of both org.apache.hadoop.mapred.lib.MultipleOutputs and org.apache.hadop.mapred.lib.MultipleOutputFormat in org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see MAPREDUCE-370 for more details. Thanks Amareshwari On 1/5/10 5:56 PM, 松柳 lamfeeli...@gmail.com wrote: I'm afraid you have to write it by yourself, since there are no equivalent classes in new API. 2009/12/28 Huazhong Ning n...@akiira.com Hi all, I need your help on multiple file output. I have many big files and I hope the processing result of each file is outputted to a separate file. I know in the old Hadoop APIs, the class MultipleOutputFormat works for this propose. But I cannot find the same class in new APIs. Does anybody know in the new APIs how to solve this problem? Thanks a lot. Ning, Huazhong
Re: Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010
Hi, Do you know if any presentation will be available over Internet when finished or any broadcasting ? Thx
Dynamically Adding Map Slots
Hello, Is it possible to add more map slots per node during the runtime of a MR job? Thanks. -- Navraj S. Chohan nlak...@gmail.com
Re: Dynamically Adding Map Slots
Not in any nice way, as far as I know. You could shut down the TaskTrackers one at a time, update their config files to add slots, and start them up again, but you'd cause some tasks to fail this way, and you might also have the JobTracker deciding that map outputs on a given TT can't be fetched and re-running those maps elsewhere. On Jan 6, 2010, at 9:29 AM, Navraj S. Chohan wrote: Hello, Is it possible to add more map slots per node during the runtime of a MR job? Thanks. -- Navraj S. Chohan nlak...@gmail.com
Configuration values only needed by master daemons, only by slaves, or both
I'd like to minimize clutter and unneeded values in the core-, hdfs-, and mapred-site.xml files that appear on the master, and that appear on the slaves, only having those that are actually used in the files on the NN/SNN/JT, and in the files on the DNs/TTs. Some values are clearly only needed on the master or only on the slaves, or on both, but for many it's not clear. Is there a summary containing this information? I know that in the distribution's docs directory the {core|hdfs|mapred}-default.html files list all values, with defaults and descriptions, but not always on what daemon(s) they're needed. Thanks.
Re: debian package of hadoop
On Monday 04 January 2010 13:37:48 Steve Loughran wrote: Jordà Polo wrote: I have been thinking about an official Hadoop Debian package for a while too. If you want official as in can say Apache Hadoop on it, then it will need to be managed and released as an apache project. That means somewhere in ASF SVN. If you want to cut your own, please give it a different name to avoid problems later. Huh? I am lost and confused here: As far as I understood Thomas is trying to create a Debian package which then goes into the Debian distribution (possibly sid at the moment). Same was done e.g. with Lucene, httpd, Tomcat etc. All of these packages are maintained by Debian people and not pushed by Apache guys. Still the packages are named tomcat5.5, apache2.2-common, liblucene-java. So it seems possible to name official Debian packages similar to the upstream Apache project w/o much problems. Isabel -- |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net signature.asc Description: This is a digitally signed message part.
Re: debian package of hadoop
On Monday 04 January 2010 15:46:55 Steve Loughran wrote: What use cases are you thinking of here? 1) developer coding against the hadoop Java and C APIs +1 2) Someone setting up a small 1-5 machine cluster +0 3) large production datacentre of hundreds of worker nodes 4) transient virtualised worker nodes Installing Hadoop on Debian for me would mean something like providing the minimal installation that gives me a running Hadoop node. I would guess that clusters of hundreds of worker nodes are different enough from one another to require additional configuration work on the administrators side anyways. If this were a wish list, I would love to be able to install a package hdfs, one for map reduce, another one for hbase (that itself depends on hdfs and map reduce). There should be one that is binary only, one for the development libs (as I would love to code against the Hadoop APIs), there will probably be one for the documentation. I would find configuration files where I expect them to be (somewhere at /etc/hadoop/ maybe) and data where it belongs (/var/hadoop?). The setup would help me to easily get Hadoop up and running as a newbie (something like apt-get install hadoop - maybe adjusting some configuration afterwards to add more nodes to the cluster). It would make upgrading to new Hadoop versions less painful. ;) Isabel -- |\ _,,,---,,_ Web: http://www.isabel-drost.de /,`.-'`'-. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: xmpp://main...@spaceboyz.net signature.asc Description: This is a digitally signed message part.
NYC Event: Hadoop a Whirlwind Tour
Sorry for the short notice. Tonight, January 06, 2010 at 6:45 NYC BUG (NYC BSD Users Group) have asked me to do a presentation on Hadoop. Presentation Information: http://www.nycbug.org/index.php?NAV=Home;SUBM=10260 Slides: http://www.nycbug.org/files/meeting_2010-01.pdf Description: This presentation gives a brief high level overview of Hadoop. Next, we hit the ground running with a quick practical example of how Hadoop solves a big data problem. We also discuss how the demonstrated Hadoop processing model scales out to terabytes of data and hundreds or even thousands of computers. I am excited here, because it is a chance to bring some BSD users into the hadoop fold. I also built a preliminary FreeBSD port of hadoop http://www.jointhegrid.com/jtg_ports/ in case after the presentation someone wants to dive into hadoop. Again, sorry for the short notice. Edward
SF HBase User Group Meetup Jan. 27th @ StumbleUpon
Hi all, This year's first San Francisco HBase User Group meetup takes place on January 27th at StumbleUpon. The first talk will be about the upcoming versions, others to be announced. RSVP at: http://su.pr/6Cldz7 See you there! J-D
mapper rusn into deadlock when using custom InputReader
Hi, we got an application that runs into a never ending mapper routine when we start the application with more than 1 mappers. If we start the application on a Cluster or Pseudo Cluster with only one mapper and reducer it is doing fine. We use a custom FileInputFormat with a custom RecordReader. Their code is attached. This is the mapper function. For clarity I removed most of the code because there is no error within the map function. As you will see in the log messages below for a run with two mappers that both mapper completely run through the map code with no error. The problem is somewhere after the map and before the reduce part of the run. And as said before it is only faced if we use more than one mapper. When it has done 50% mapping and 16% reducing it doesnt reply any more and runs infinitely. public void map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, Reporter reporter) throws IOException { LOG.info(Masks: + BinaryStringConverter.parseLongToBinaryString(MASKS[0]) + , + BinaryStringConverter.parseLongToBinaryString(MASKS[1]) + , + BinaryStringConverter.parseLongToBinaryString(MASKS[2]) + , + BinaryStringConverter.parseLongToBinaryString(MASKS[3])); . ... LOG.info(Finished with mapper commands.); } When starting the application with more than one mapper. Every mapper reachs the last LOG.info output of the mapping function. But the output logs of the mapper look like this: Mapper that failed: 2010-01-05 15:34:16,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2010-01-05 15:34:16,796 INFO de.hpi.hadoop.duplicates.LongRecordReader: Splitting from 0 to 800 length: 800 2010-01-05 15:34:16,828 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2 2010-01-05 15:34:16,859 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 . ... 2010-01-05 15:34:26,828 INFO de.hpi.hadoop.duplicates.DuplicateFinder: Finished with mapper commands. Mapper that does not fail: 2010-01-05 15:34:16,656 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2010-01-05 15:34:16,828 INFO de.hpi.hadoop.duplicates.LongRecordReader: Splitting from 800 to 1600 length: 800 2010-01-05 15:34:16,843 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2 2010-01-05 15:34:16,859 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 . ... 2010-01-05 15:34:26,765 INFO de.hpi.hadoop.duplicates.DuplicateFinder: Finished with mapper commands. 2010-01-05 15:34:26,765 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2010-01-05 15:34:27,531 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2010-01-05 15:34:27,578 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_201001051529_0002_m_01_0 is done. And is in the process of commiting 2010-01-05 15:34:27,656 INFO org.apache.hadoop.mapred.TaskRunner: Task 'attempt_201001051529_0002_m_01_0' done. Please share if you have faced similar problem or if you know the solution or you need more information. Thanks, Ziawasch Abedjan __ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com package de.hpi.hadoop.duplicates; import java.io.IOException; import java.io.InputStream; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.RecordReader; public class LongRecordReader implements RecordReaderLongWritable, Text { private long start; private long pos; private long end; private LongReader in; private LongWritable key = null; private Text value = null; private static final Log LOG = LogFactory.getLog(LongRecordReader.class); public LongRecordReader(FileSplit split, Configuration job) throws IOException { start = split.getStart(); end = start + split.getLength(); final Path file = split.getPath(); LOG.info(Splitting from + start + to + end + length: + split.getLength()); // open the file and seek to the start of the split FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); if (start != 0) {
Hadoop 0.20.1 Amazon Image, Permission error?
Hi all, I created an amazon image for hadoop 0.20.1, it seems OK when I finished bundle, but when I launched the cluster using hadoop-ec2 command line, hadoop doesnt started up with the machines. I checked the files in usr/local directory, and found all of them are without a execution permission. I guess this is the problem, the bundle script doesnt change the permission of JDK, and hadoop for EXECUTION. Can anyone tell me, am I right? If so, do I need to change the script accordingly? Thanks in advance. Song Liu