Re: Multiple file output

2010-01-06 Thread Amareshwari Sri Ramadasu
No. It is part of branch 0.21 onwards. For 0.20*, people can use old api only, 
though JobConf is deprecated.

-Amareshwari.

On 1/6/10 11:52 AM, Vijay tec...@gmail.com wrote:

org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the
released version of 0.20.1 right? Is this expected to be part of 0.20.2 or
later?


2010/1/5 Amareshwari Sri Ramadasu amar...@yahoo-inc.com

 In branch 0.21, You can get the functionality of both
 org.apache.hadoop.mapred.lib.MultipleOutputs and
 org.apache.hadop.mapred.lib.MultipleOutputFormat in
 org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see
 MAPREDUCE-370 for more details.

 Thanks
 Amareshwari

 On 1/5/10 5:56 PM, 松柳 lamfeeli...@gmail.com wrote:

 I'm afraid you have to write it by yourself, since there are no equivalent
 classes in new API.

 2009/12/28 Huazhong Ning n...@akiira.com

  Hi all,
 
  I need your help on multiple file output. I have many big files and I
 hope
  the processing result of each file is outputted to a separate file. I
 know
  in the old Hadoop APIs, the class MultipleOutputFormat works for this
  propose. But I cannot find the same class in new APIs. Does anybody know
 in
  the new APIs how to solve this problem?
  Thanks a lot.
 
  Ning, Huazhong
 
 
 





Re: Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010

2010-01-06 Thread Alexandre Jaquet
Hi,

Do you know if any presentation will be available over Internet when
finished or any broadcasting ?

Thx


Dynamically Adding Map Slots

2010-01-06 Thread Navraj S. Chohan
Hello,
Is it possible to add more map slots per node during the runtime of a MR
job?
Thanks.

-- 
Navraj S. Chohan
nlak...@gmail.com


Re: Dynamically Adding Map Slots

2010-01-06 Thread Matei Zaharia
Not in any nice way, as far as I know. You could shut down the TaskTrackers one 
at a time, update their config files to add slots, and start them up again, but 
you'd cause some tasks to fail this way, and you might also have the JobTracker 
deciding that map outputs on a given TT can't be fetched and re-running those 
maps elsewhere.

On Jan 6, 2010, at 9:29 AM, Navraj S. Chohan wrote:

 Hello,
 Is it possible to add more map slots per node during the runtime of a MR
 job?
 Thanks.
 
 -- 
 Navraj S. Chohan
 nlak...@gmail.com



Configuration values only needed by master daemons, only by slaves, or both

2010-01-06 Thread Derek Brown
I'd like to minimize clutter and unneeded values in the core-, hdfs-, and
mapred-site.xml files that appear on the master, and that appear on the
slaves, only having those that are actually used in the files on the
NN/SNN/JT, and in the files on the DNs/TTs. Some values are clearly only
needed on the master or only on the slaves, or on both, but for many it's
not clear.

Is there a summary containing this information? I know that in the
distribution's docs directory the {core|hdfs|mapred}-default.html files list
all values, with defaults and descriptions, but not always on what daemon(s)
they're needed.

Thanks.


Re: debian package of hadoop

2010-01-06 Thread Isabel Drost
On Monday 04 January 2010 13:37:48 Steve Loughran wrote:
 Jordà Polo wrote:
  I have been thinking about an official Hadoop Debian package for a while
  too.

 If you want official as in can say Apache Hadoop on it, then it will
 need to be managed and released as an apache project. That means
 somewhere in ASF SVN. If you want to cut your own, please give it a
 different name to avoid problems later.

Huh? I am lost and confused here: As far as I understood Thomas is trying to 
create a Debian package which then goes into the Debian distribution 
(possibly sid at the moment).

Same was done e.g. with Lucene, httpd, Tomcat etc. All of these packages are 
maintained by Debian people and not pushed by Apache guys. Still the packages 
are named tomcat5.5, apache2.2-common, liblucene-java. So it seems possible 
to name official Debian packages similar to the upstream Apache project w/o 
much problems.

Isabel

-- 
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net



signature.asc
Description: This is a digitally signed message part.


Re: debian package of hadoop

2010-01-06 Thread Isabel Drost
On Monday 04 January 2010 15:46:55 Steve Loughran wrote:

 What use cases are you thinking of here?

 1) developer coding against the hadoop Java and C APIs

+1


 2) Someone setting up a small 1-5 machine cluster

+0


 3) large production datacentre of hundreds of worker nodes
 4) transient virtualised worker nodes

Installing Hadoop on Debian for me would mean something like providing the 
minimal installation that gives me a running Hadoop node. I would guess that 
clusters of hundreds of worker nodes are different enough from one another to 
require additional configuration work on the administrators side anyways.

If this were a wish list, I would love to be able to install a package hdfs, 
one for map reduce, another one for hbase (that itself depends on hdfs 
and map reduce). There should be one that is binary only, one for the 
development libs (as I would love to code against the Hadoop APIs), there 
will probably be one for the documentation. I would find configuration files 
where I expect them to be (somewhere at /etc/hadoop/ maybe) and data where it 
belongs (/var/hadoop?). The setup would help me to easily get Hadoop up and 
running as a newbie (something like apt-get install hadoop - maybe adjusting 
some configuration afterwards to add more nodes to the cluster). It would 
make upgrading to new Hadoop versions less painful. ;)

Isabel

-- 
  |\  _,,,---,,_   Web:   http://www.isabel-drost.de
  /,`.-'`'-.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  xmpp://main...@spaceboyz.net



signature.asc
Description: This is a digitally signed message part.


NYC Event: Hadoop a Whirlwind Tour

2010-01-06 Thread Edward Capriolo
Sorry for the short notice. Tonight, January 06, 2010 at 6:45

NYC BUG (NYC BSD Users Group) have asked me to do a presentation on Hadoop.

Presentation Information:
http://www.nycbug.org/index.php?NAV=Home;SUBM=10260
Slides:
http://www.nycbug.org/files/meeting_2010-01.pdf
Description:
This presentation gives a brief high level overview of Hadoop. Next,
we hit the ground running with a quick practical example of how Hadoop
solves a big data problem. We also discuss how the demonstrated
Hadoop processing model scales out to terabytes of data and hundreds
or even thousands of computers.

I am excited here, because it is a chance to bring some BSD users into
the hadoop fold.
I also built a preliminary FreeBSD port of hadoop
http://www.jointhegrid.com/jtg_ports/
in case after the presentation someone wants to dive into hadoop.

Again, sorry for the short notice.

Edward


SF HBase User Group Meetup Jan. 27th @ StumbleUpon

2010-01-06 Thread Jean-Daniel Cryans
Hi all,

This year's first San Francisco HBase User Group meetup takes place on
January 27th at StumbleUpon. The first talk will be about the upcoming
versions, others to be announced.

RSVP at: http://su.pr/6Cldz7

See you there!

J-D


mapper rusn into deadlock when using custom InputReader

2010-01-06 Thread Ziawasch Abedjan
Hi,

we got an application that runs into a never ending mapper
routine when we start the application with more than 1 mappers. If we
start the application on a Cluster or Pseudo Cluster with only one
mapper and reducer it is doing fine. We use a custom FileInputFormat
with a custom RecordReader. Their code is attached.

This
is the mapper function. For clarity I removed most of the code because
there is no error within the map function. As you will see in the log
messages below for a run with two mappers that both mapper completely
run through the map code with no error. The problem is somewhere after
the map and before the reduce part of the run. And as said before it is
only faced if we use more than one mapper. When it has done 50% 
mapping and 16% reducing it doesnt reply any more and runs infinitely.

public void
 map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, 
Reporter reporter) throws IOException {
   
        LOG.info(Masks:  +
BinaryStringConverter.parseLongToBinaryString(MASKS[0]) + ,  +
BinaryStringConverter.parseLongToBinaryString(MASKS[1]) + ,  +
BinaryStringConverter.parseLongToBinaryString(MASKS[2]) + ,  +
BinaryStringConverter.parseLongToBinaryString(MASKS[3]));
.
...
            LOG.info(Finished with mapper commands.);
        }



When starting the application with more than one mapper. Every mapper
reachs the last LOG.info output of the mapping function.

But the output logs of the mapper look like this:
Mapper that failed:

2010-01-05 15:34:16,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=

2010-01-05 15:34:16,796 INFO de.hpi.hadoop.duplicates.LongRecordReader: 
Splitting from 0 to 800 length: 800

2010-01-05 15:34:16,828 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2

2010-01-05 15:34:16,859 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100

2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
79691776/99614720

2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: record buffer = 
262144/327680
.
...
2010-01-05 15:34:26,828 INFO de.hpi.hadoop.duplicates.DuplicateFinder: Finished 
with mapper commands.

Mapper that does not fail:

2010-01-05 15:34:16,656 INFO
 org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with 
processName=MAP, sessionId=
2010-01-05 15:34:16,828 INFO de.hpi.hadoop.duplicates.LongRecordReader: 
Splitting from 800 to 1600 length: 800
2010-01-05 15:34:16,843 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 2
2010-01-05 15:34:16,859 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
79691776/99614720
2010-01-05 15:34:25,609 INFO org.apache.hadoop.mapred.MapTask: record buffer = 
262144/327680
.
...
2010-01-05 15:34:26,765 INFO de.hpi.hadoop.duplicates.DuplicateFinder: Finished 
with mapper commands.
2010-01-05 15:34:26,765 INFO org.apache.hadoop.mapred.MapTask: Starting flush 
of map output
2010-01-05 15:34:27,531 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0
2010-01-05 15:34:27,578 INFO org.apache.hadoop.mapred.TaskRunner:
 Task:attempt_201001051529_0002_m_01_0 is done. And is in the process of 
commiting
2010-01-05 15:34:27,656 INFO org.apache.hadoop.mapred.TaskRunner: Task 
'attempt_201001051529_0002_m_01_0' done.


Please share if you have faced similar problem or if you know the solution or 
you need more information.


Thanks,
Ziawasch Abedjan


__
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com package de.hpi.hadoop.duplicates;

import java.io.IOException;
import java.io.InputStream;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.RecordReader;

public class LongRecordReader implements RecordReaderLongWritable, Text {
	private long start;
	private long pos;
	private long end;
	private LongReader in;
	private LongWritable key = null;
	private Text value = null;
	
	private static final Log LOG = LogFactory.getLog(LongRecordReader.class);
	
	public LongRecordReader(FileSplit split, Configuration job) throws IOException {
		
		start = split.getStart();
		end = start + split.getLength();
		final Path file = split.getPath();
		
		LOG.info(Splitting from  + start +  to  + end +  length:  + split.getLength());

		// open the file and seek to the start of the split
		FileSystem fs = file.getFileSystem(job);
		FSDataInputStream fileIn = fs.open(split.getPath());
		if (start != 0) {
			

Hadoop 0.20.1 Amazon Image, Permission error?

2010-01-06 Thread 松柳
Hi all, I created an amazon image for hadoop 0.20.1, it seems OK when I
finished bundle, but when I launched the cluster using hadoop-ec2 command
line, hadoop doesnt started up with the machines.

I checked the files in usr/local directory, and found all of them are
without a execution permission. I guess this is the problem, the bundle
script doesnt change the permission of JDK, and hadoop for EXECUTION.

Can anyone tell me, am I right? If so, do I need to change the script
accordingly?

Thanks in advance.

Song Liu