Re: Splitting in various files

2008-04-21 Thread Aayush Garg
Could anyone please tell?

On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg [EMAIL PROTECTED] wrote:

 Hi,

 I have written the following code for writing my key,value pairs in the
 file, and this file is then read by another MR.

Path pth = new Path(./dir1/dir2/filename);
FileSystem fs = pth.getFileSystem(jobconf);
SequenceFile.Writer sqwrite = new
 SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class);
sqwrite.append(Key,value);
sqwrite.close();

 I problem is I get my data written in one file(filename).. How can it be
 split across in the number of files. If I give only the path of directory in
 this progam then it does not get compiled.

 I give only the path of directory /dir1/dir2 to another Map Reduce and it
 reads the file.

 Thanks,




-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Interleaving maps/reduces from multiple jobs on the same tasktracker

2008-04-21 Thread Amar Kamat

Amar Kamat wrote:

Jiaqi Tan wrote:

Hi,

Will Hadoop ever interleave multiple maps/reduces from different jobs
on the same tasktracker?
  

No.

Suppose I have 2 jobs submitted to a jobtracker, one after the other.
Must all maps/reduces from the first submitted job be completed before
the tasktrackers will run any of the maps/reduces from the second
submitted job?
  
Tasks from a new job is never scheduled unless the tasks from other 
high priority jobs are completely scheduled. Within the jobs with the 
same priority, the start time is what finally matters. A job's 
priority at the JobTracker can be changed on the fly but as of now its 
not supported by the JobClient.

But you can change it through the job's web UI.
Amar

Amar

Thanks.

Jiaqi Tan
  






datanode files list

2008-04-21 Thread Shimi K
Is there a way to get the list of files on each datanode?
I need to be able to get all the names of the files on a specific datanode?
is there a way to do it?


Re: jar files on NFS instead of DistributedCache

2008-04-21 Thread Steve Loughran

Joydeep Sen Sarma wrote:

i would love this feature. it does not exist currently. if we set the classpath 
for the tasktracker - then as mentioned - it's for all the tasks. if the 
classpath can be set on a per task basis - that works as an excellent solution 
with a nfs based environment for specifying multiple jar files for the job.

related - see https://issues.apache.org/jira/browse/HADOOP-1622. it asks for 
multiple jars to be packaged into the jobcache. but with nfs - all we need is a 
per job classpath option.



true, but you are creating a SPOF. There is nothing like 200 boxes all 
block with their consoles going 'NFS Server not responding for 30s to 
make you wish you weren't using NFS. Because NFS IO is done in the 
kernel, there's no way to put policy into the apps about retry, 
timeouts and behaviour on failure.


just say no to NFS.

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Splitting in various files

2008-04-21 Thread Aayush Garg
I just tried the same thing (mapred.task.id)as you told..But I am getting
one file named null in my directory.

On Mon, Apr 21, 2008 at 8:33 AM, Amar Kamat [EMAIL PROTECTED] wrote:

 Aayush Garg wrote:

  Could anyone please tell?
 
  On Sat, Apr 19, 2008 at 1:33 PM, Aayush Garg [EMAIL PROTECTED]
  wrote:
 
 
 
   Hi,
  
   I have written the following code for writing my key,value pairs in
   the
   file, and this file is then read by another MR.
  
 Path pth = new Path(./dir1/dir2/filename);
 FileSystem fs = pth.getFileSystem(jobconf);
 SequenceFile.Writer sqwrite = new
   SequenceFile.Writer(fs,conf,pth,Text.class,Custom.class);
 sqwrite.append(Key,value);
 sqwrite.close();
  
   I problem is I get my data written in one file(filename).. How can it
   be
   split across in the number of files. If I give only the path of
   directory in
  
  
  What do you mean by splitting a file across multiple files? If you want
 a separate file for each map/reduce task then you can use conf.get(
 mapred.task.id) to get the task id that is unique for that task. Now you
 can name the file like

 Path pth = new Path(./dir1/dir2/ + filename + - + conf.get(
 mapred.task.id));

 Amar

  this progam then it does not get compiled.
  
   I give only the path of directory /dir1/dir2 to another Map Reduce and
   it
   reads the file.
  
   Thanks,
  
  
  
  
 
 
 
 




Re: Error in start up

2008-04-21 Thread Aayush Garg
Could anyone please help me with this error below ? I am not able to start
HDFS due to this?

Thanks,

On Sat, Apr 19, 2008 at 7:25 PM, Aayush Garg [EMAIL PROTECTED] wrote:

 I have my hadoop-site.xml correct !! but it creates error in this way


 On Sat, Apr 19, 2008 at 6:35 PM, Stuart Sierra [EMAIL PROTECTED]
 wrote:

  On Sat, Apr 19, 2008 at 9:53 AM, Aayush Garg [EMAIL PROTECTED]
  wrote:
I am getting following error on start up the hadoop as pseudo
  distributed::
  
bin/start-all.sh
  
localhost: starting datanode, logging to
  
   
  /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-datanode-R61-neptun.out
localhost: starting secondarynamenode, logging to
  
   
  /home/garga/Documents/hadoop-0.15.3/bin/../logs/hadoop-root-secondarynamenode-R61-neptun.out
localhost: Exception in thread main
  java.lang.IllegalArgumentException:
port out of range:-1
 
  Hello, I'm a Hadoop newbie, but when I got this error it seemed to be
  caused by an empty/incorrect hadoop-site.xml.  See
 
  http://wiki.apache.org/hadoop/QuickStart#head-530fd2e5b7fc3f35a210f3090f125416a79c2e1b
 
  -Stuart
 






Using ArrayWritable of type IntWritable

2008-04-21 Thread CloudyEye

Hi,

From the API , I should create new class as follows:

public class IntArrayWritable extends ArrayWritable { 
public IntArrayWritable() { 
super(IntWritable.class); 
}
 }

In the reducer, When executing 
OutputCollector.collect(WritableComparable key, IntArrayWritable arr)

I am getting something like:
key1[EMAIL PROTECTED]
key2[EMAIL PROTECTED]
key3[EMAIL PROTECTED]
key4[EMAIL PROTECTED]
...
...

What else do I have to override in ArrayWritable  to get the IntWritable
values written to the output files by the reducers?
If you have some ready code for such case I will be very thankful.

Cheers,

-- 
View this message in context: 
http://www.nabble.com/Using-ArrayWritable-of-type-IntWritable-tp16807489p16807489.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: New bee quick questions :-)

2008-04-21 Thread Luca

vikas wrote:

Hi,

I'm new to HADOOP. And aiming to develop good amount of code with it. I've
some quick questions it would be highly appreciable if some one can answer
them.

I was able to run HADOOP in cygwin environment. run the examples both in
standalone mode as well as in a 2 node cluster.

1) How can I over come the difficulty of giving password for SSH logins when
ever DataNodes are getting started.



Creating SSH keys (either user or host based) and pairing the hosts with 
these keys. This is the first URL I got for a ssh public key 
authentication): http://sial.org/howto/openssh/publickey-auth/



2) I've put some 1.5 GB of file in my Master node where even a DataNode is
running. I want to see how load balancing can be done so that disk space
will be utilized even from other datanodes.



Not sure how to answer to this one. HDFS has knowledge of three 
entities: the node, the rack, and the rest. In the default 
configuration, each block is replicated 3 times, one for each entity. If 
you don't have racks and so you might want to fine tune replication of 
files through HDFS shell.



3) How can I add a new DataNode without stopping HADOOP.


Just add it to the slaves and run start-dfs.sh. Already running nodes 
won't be touched.




4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
is there any way to inform Hadoop saying that this particular datanode is
going done -- please make sure the data in it is replicated else where ?



Replication of blocks with a factor = 2 should do the job. In the 
general case, default replication is 3. You can check the replication 
factor through HDFS shell.



5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
in that they were specifying a Hadoop cluster has multiple cores -- what
does this mean ?



Are you talking about multi-core processors?


  5.1) can I have multiple instance of namenodes running in a cluster apart
from secondary nodes ?



Not sure on this, but as far as I know there's only one namenode that 
should be running.



6) If I go on create huge files will they be balanced among all the
datanodes ? or do I need to change the creation of file location in the
application.



Files are divided in blocks. Then blocks are replicated. Huge files are 
simply composed by a larger set of blocks. In principle, you don't know 
where your blocks will end up, apart from the entities I mentioned 
before. And in principle, you shouldn't care about where they end with, 
because Hadoop applications will take care of sending tasks where the 
data reside.


Ciao,
Luca



hadoop 0.16.3 problems to submit job

2008-04-21 Thread Andreas Kostyrka
Hi!

When I'm trying to submit a streaming job to my EC2/S3 cluster, I'm
getting the following errors:

additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar53992/] [] 
/tmp/streamjob53993.jar tmpDir=null
^[[1;5A08/04/21 14:26:11 WARN httpclient.RestS3Service: Retrying request - 
attempt 1 of 5
08/04/21 14:26:11 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 14:26:11 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:27:44 WARN httpclient.RestS3Service: Retrying request - attempt 2 
of 5
08/04/21 14:27:44 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketTimeoutException) caught when processing request: Read timed out
08/04/21 14:27:44 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:32:12 WARN service.S3Service: Encountered 1 S3 Internal Server 
error(s), will retry in 50ms
08/04/21 14:32:21 WARN service.S3Service: Encountered 1 S3 Internal Server 
error(s), will retry in 50ms
08/04/21 14:32:26 WARN service.S3Service: Encountered 1 S3 Internal Server 
error(s), will retry in 50ms
08/04/21 14:32:31 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 14:32:31 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 14:32:31 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:32:35 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 14:32:35 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 14:32:35 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:36:06 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 14:36:06 INFO httpclient.HttpMethodDirector: I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server s3.amazonaws.com failed to respond
08/04/21 14:36:06 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:36:34 WARN httpclient.RestS3Service: Retrying request - attempt 2 
of 5
08/04/21 14:36:34 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 14:36:34 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:41:47 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 14:41:47 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketTimeoutException) caught when processing request: Read timed out
08/04/21 14:41:47 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 14:54:00 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 14:54:00 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 14:54:00 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 15:04:48 INFO mapred.FileInputFormat: Total input paths to process : 
18611
08/04/21 15:06:43 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 15:06:43 INFO httpclient.HttpMethodDirector: I/O exception 
(org.apache.commons.httpclient.NoHttpResponseException) caught when processing 
request: The server s3.amazonaws.com failed to respond
08/04/21 15:06:43 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 15:14:33 WARN service.S3Service: Encountered 1 S3 Internal Server 
error(s), will retry in 50ms
08/04/21 15:14:34 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 15:14:34 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 15:14:34 INFO httpclient.HttpMethodDirector: Retrying request
08/04/21 15:15:02 WARN httpclient.RestS3Service: Retrying request - attempt 1 
of 5
08/04/21 15:15:02 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Connection reset
08/04/21 15:15:02 INFO httpclient.HttpMethodDirector: Retrying request

Now I wonder if it's normal? Notice that it took hadoop 38 minutes to
figure out the number of input files.

Now this has been running for around an hour and I still have no idea if
the job will be submitted or will hang forever (I had such cases with
0.16.2 where I stopped the submission after half a day).

Any ideas?

TIA,

Andreas Kostyrka


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: New bee quick questions :-)

2008-04-21 Thread Allen Wittenauer



On 4/21/08 3:36 AM, vikas [EMAIL PROTECTED] wrote:

Most of your questions have been answered by Luca, from what I can see,
so let me tackle the rest a bit...

 4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
 is there any way to inform Hadoop saying that this particular datanode is
 going done -- please make sure the data in it is replicated else where ?

You want to do datanode decommissioning.  See
http://wiki.apache.org/hadoop/FAQ#17 for details.

 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
 in that they were specifying a Hadoop cluster has multiple cores -- what
 does this mean ?

I haven't watched the tech talks in ages, but we generally refer to
cores in a variety of ways.  There is the single physical box verson--an
individual processor has more than one execution unit, thereby giving it a
degree of parallelism.  Then there is the complete grid count--an individual
grid can have lots and lots of processors with lots and lots of individual
cores on those processors which works out to be a pretty good rough
estimation of how many individual Hadoop tasks can be run simultaneously.

   5.1) can I have multiple instance of namenodes running in a cluster apart
 from secondary nodes ?

No.  The name node is a single point of failure in the system.
 
 6) If I go on create huge files will they be balanced among all the
 datanodes ? or do I need to change the creation of file location in the
 application.

In addition to what Luca said, be aware that if you load a file on a
machine with a data node process, the data for that file will *always* get
loaded to that machine.  This can cause your data nodes to get extremely
unbalanced.   You are much better off doing data loads *off grid*/from
another machine.  Since you only need the hadoop configuration and binaries
available (in other words, no hadoop processes need be running), this
usually isn't too painful to do.

In 0.16.x, there is a rebalancer to help fix this situation, but I have
no practical experience with it yet to say whether or not it works.



Re: datanode files list

2008-04-21 Thread Ted Dunning

Datanodes don't necessarily contain complete files.  It is possible to
enumerate all files and to find out which datanodes host different blocks
from these files.

What did you need to do?


On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote:

 Is there a way to get the list of files on each datanode?
 I need to be able to get all the names of the files on a specific datanode?
 is there a way to do it?



Re: datanode files list

2008-04-21 Thread Shimi K
I am using Hadoop HDFS as a distributed file system. On each DFS node I have
another process which needs to read the local HDFS files.
Right now I'm calling the NameNode in order to get the list of all the files
in the cluster. For each file I check if it is a local file (one of the
locations is the host of the node), if it is I read it.
Disadvantages:
* This solution works only if the entire file is not split.
* It involves the NameNode.
* Each node needs to iterate on all the files in the cluster.

There must be a better way to do it. The perfect way will be to call the
DataNode and to get a list of the local files and their blocks.

On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote:


 Datanodes don't necessarily contain complete files.  It is possible to
 enumerate all files and to find out which datanodes host different blocks
 from these files.

 What did you need to do?


 On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote:

  Is there a way to get the list of files on each datanode?
  I need to be able to get all the names of the files on a specific
 datanode?
  is there a way to do it?




Re: datanode files list

2008-04-21 Thread Ted Dunning

This is kind of odd that you are doing this.  It really sounds like a
replication of what hadoop is doing.

Why not just run a map process and have hadoop figure out which blocks are
where?  

Can you say more about *why* you are doing this, not just what you are
trying to do?

On 4/21/08 10:28 AM, Shimi K [EMAIL PROTECTED] wrote:

 I am using Hadoop HDFS as a distributed file system. On each DFS node I have
 another process which needs to read the local HDFS files.
 Right now I'm calling the NameNode in order to get the list of all the files
 in the cluster. For each file I check if it is a local file (one of the
 locations is the host of the node), if it is I read it.
 Disadvantages:
 * This solution works only if the entire file is not split.
 * It involves the NameNode.
 * Each node needs to iterate on all the files in the cluster.
 
 There must be a better way to do it. The perfect way will be to call the
 DataNode and to get a list of the local files and their blocks.
 
 On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote:
 
 
 Datanodes don't necessarily contain complete files.  It is possible to
 enumerate all files and to find out which datanodes host different blocks
 from these files.
 
 What did you need to do?
 
 
 On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote:
 
 Is there a way to get the list of files on each datanode?
 I need to be able to get all the names of the files on a specific
 datanode?
 is there a way to do it?
 
 



Re: datanode files list

2008-04-21 Thread Shimi K
Do you remember the Caching frequently map input files thread?
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200802.mbox/[EMAIL 
PROTECTED]


On Mon, Apr 21, 2008 at 8:31 PM, Ted Dunning [EMAIL PROTECTED] wrote:


 This is kind of odd that you are doing this.  It really sounds like a
 replication of what hadoop is doing.

 Why not just run a map process and have hadoop figure out which blocks are
 where?

 Can you say more about *why* you are doing this, not just what you are
 trying to do?

 On 4/21/08 10:28 AM, Shimi K [EMAIL PROTECTED] wrote:

  I am using Hadoop HDFS as a distributed file system. On each DFS node I
 have
  another process which needs to read the local HDFS files.
  Right now I'm calling the NameNode in order to get the list of all the
 files
  in the cluster. For each file I check if it is a local file (one of the
  locations is the host of the node), if it is I read it.
  Disadvantages:
  * This solution works only if the entire file is not split.
  * It involves the NameNode.
  * Each node needs to iterate on all the files in the cluster.
 
  There must be a better way to do it. The perfect way will be to call the
  DataNode and to get a list of the local files and their blocks.
 
  On Mon, Apr 21, 2008 at 7:18 PM, Ted Dunning [EMAIL PROTECTED] wrote:
 
 
  Datanodes don't necessarily contain complete files.  It is possible to
  enumerate all files and to find out which datanodes host different
 blocks
  from these files.
 
  What did you need to do?
 
 
  On 4/21/08 2:11 AM, Shimi K [EMAIL PROTECTED] wrote:
 
  Is there a way to get the list of files on each datanode?
  I need to be able to get all the names of the files on a specific
  datanode?
  is there a way to do it?
 
 




Re: Any API used to get the last modified time of the File in HDFS?

2008-04-21 Thread Shimi K
String fileName = file.txt
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path filePath = new Path(fileName);
FileStatus fileStatus = fs.getFileStatus(filePath);
long modificationTime = fileStatus.getModificationTime();

On Mon, Apr 21, 2008 at 5:35 AM, Samuel Guo [EMAIL PROTECTED] wrote:

 Hi all,

 Can anyone tell me : is there any api I can use to get the metadata info
 such as the last modified time and etc. of a File in hdfs?
 Thanks a lot :)

 Best Wishes:)

 Samuel Guo



Re: Hadoop remembering old mapred.map.tasks

2008-04-21 Thread Otis Gospodnetic
It turns out Hadoop was not remembering anything and the answer is in the FAQ:

http://wiki.apache.org/hadoop/FAQ#13

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Sunday, April 20, 2008 8:14:43 PM
 Subject: Hadoop remembering old mapred.map.tasks
 
 Hi,
 
 Does Hadoop cache settings set in hadoop-*xml between runs?
 I'm using Hadoop 0.16.2 and have initially set the number of map and reduce 
 tasks to 8 of each.  After running a number of jobs I wanted to increase that 
 number (to 23 maps and 11 reduces), so I changed the mapred.map.tasks and 
 mapred.reduce.tasks properties in hadoop-site.xml.  I then stopped everything 
 (stop-all.sh) and copied my modified hadoop-site.xml to all nodes in the 
 cluster.  I also rebuilt the .job file and pushed that out to all nodes, too.
 
 However, when I start everything up again I *still* see Map Task Capacity is 
 equal to 8, and the same for Reduce Task Capacity.
 Am I supposed to do something in addition to the above to make Hadoop 
 forget 
 my old settings?  I can't find *any* references to mapred.map.tasks in any of 
 the Hadoop files except for my hadoop-site.xml, so I can't figure out why 
 Hadoop 
 is still stuck on 8.
 
 Although the max capacity is set to 8, when I run my jobs now I *do* see that 
 they get broken up into 23 maps and 11 reduces (it was 8 before), but only 8 
 of 
 them run in parallel.  There are 4 dual-code machines in the cluster for a 
 total 
 of 8 cores.  Is Hadoop able to figure this out and that is why it runs only 8 
 tasks in parallel, despite my higher settings?
 
 Thanks,
 Otis



Re: incremental re-execution

2008-04-21 Thread Shirley Cohen

Hi Ted,

Thanks for your example. It's very interesting to learn about  
specific map reduce applications.


It's non-obvious to me that it's a good idea to combine two map- 
reduce pairs by using the cross product of the intermediate states-  
you might wind up building an O(n^2) intermediate data structure  
instead of two O(n) ones. Even with parallelism this is not good. I'm  
wondering if in your example you're relying on the fact that the  
viewer-video matrix is sparse, so many of the pairs will have value  
0? Does the map phase emit intermediate results with 0-values?


Thanks,

Shirley



Take something like what we see in our logs of viewing.  We have  
several log
entries per view each of which contains an identifier for the  
viewer and for
the video.  These events occur on start, on progress and on  
completion.  We
want to have total views per viewer and total views per video.  You  
can pass
over the logs twice to get this data or you can pass over the data  
once to
get total views per (viewer x video).  This last is a semi- 
aggregated form
that has no utility except that it is much smaller than the  
original data.
Reducing the semi-aggregated from to viewer counts and video counts  
results

in shorter total processing than processing the raw data twice.

If you start with a program that has two map-reduce passes over the  
same
data, it is likely very difficult to intuit that they could use the  
same
intermediate data.  Even with something like Pig, where you have a  
good

representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and  
two final

aggregations.


On 4/20/08 7:39 AM, Shirley Cohen [EMAIL PROTECTED] wrote:


Hi Ted,

I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hard  
to do?


Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:



I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates.  I would find it very surprising if  you could detect
these
sorts of situations.


On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote:


Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to  
speed

up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be  
valuable

to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map  
job on

many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley










Re: incremental re-execution

2008-04-21 Thread Ted Dunning

It isn't that bad.  Remember the input is sparse.

Actually, the size of the original data provides a sharp bound on the size
of the semi-aggregated data.  In practice, the semi-aggregated data will
have 1/k as many records as the original data where k is the average count.
The records in the semi-aggregated result are also typically considerably
smaller than the original records.  Even in long tail environments, k is
typically 5-20 and each record is often 2-4x smaller than the originals.

This means that there is typically an order of magnitude saving in
traversing the semi-aggregated result.

The other major approach would be to emit typed records for each full level
aggregation that you intend to do.  This gives you all of your aggregates
together in a single output, but you can jigger the partition function so
that you have a single reduce output file per aggregate type.  This approach
isn't quite a simple as the semi-aggregate approach and doesn't allow
retrospective ad hoc scans, but it is slightly to considerably faster than
the semi-agg approach, especially if you are aggregating on more than two
keys.


On 4/21/08 1:06 PM, Shirley Cohen [EMAIL PROTECTED] wrote:

 Hi Ted,
 
 Thanks for your example. It's very interesting to learn about
 specific map reduce applications.
 
 It's non-obvious to me that it's a good idea to combine two map-
 reduce pairs by using the cross product of the intermediate states-
 you might wind up building an O(n^2) intermediate data structure
 instead of two O(n) ones. Even with parallelism this is not good. I'm
 wondering if in your example you're relying on the fact that the
 viewer-video matrix is sparse, so many of the pairs will have value
 0? Does the map phase emit intermediate results with 0-values?
 
 Thanks,
 
 Shirley
 
 
 Take something like what we see in our logs of viewing.  We have
 several log
 entries per view each of which contains an identifier for the
 viewer and for
 the video.  These events occur on start, on progress and on
 completion.  We
 want to have total views per viewer and total views per video.  You
 can pass
 over the logs twice to get this data or you can pass over the data
 once to
 get total views per (viewer x video).  This last is a semi-
 aggregated form
 that has no utility except that it is much smaller than the
 original data.
 Reducing the semi-aggregated from to viewer counts and video counts
 results
 in shorter total processing than processing the raw data twice.
 
 If you start with a program that has two map-reduce passes over the
 same
 data, it is likely very difficult to intuit that they could use the
 same
 intermediate data.  Even with something like Pig, where you have a
 good
 representation for internal optimizations, it is probably going to be
 difficult to convert the two MR steps into one pre-aggregation and
 two final
 aggregations.
 
 
 On 4/20/08 7:39 AM, Shirley Cohen [EMAIL PROTECTED] wrote:
 
 Hi Ted,
 
 I'm confused about your second comment below: in the case where semi-
 aggregated data is used to produce multiple low-level aggregates,
 what sorts of detection did you have in mind which would be hard
 to do?
 
 Thanks,
 
 Shirley
 
 On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:
 
 
 I re-use outputs of MR programs pretty often, but when I need to
 reuse the
 map output, I just manually break the process apart into a
 map+identity-reducer and the multiple reducers.  This is rare.
 
 It is common to have a semi-aggregated form that is much small than
 the
 original data which in turn can be used to produce multiple low
 definition
 aggregates.  I would find it very surprising if  you could detect
 these
 sorts of situations.
 
 
 On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote:
 
 Dear Hadoop Users,
 
 I'm writing to find out what you think about being able to
 incrementally re-execute a map reduce job. My understanding is that
 the current framework doesn't support it and I'd like to know
 whether, in your opinion, having this capability could help to
 speed
 up development and debugging.
 
 My specific questions are:
 
 1) Do you have to re-run a job often enough that it would be
 valuable
 to incrementally re-run it?
 
 2) Would it be helpful to save the output from a whole bunch of
 mappers and then try to detect whether this output can be re-used
 when a new job is launched?
 
 3) Would it be helpful to be able to use the output from a map
 job on
 many reducers?
 
 Please let me know what your thoughts are and what specific
 applications you are working on.
 
 Much appreciation,
 
 Shirley
 
 
 
 



Re: Using ArrayWritable of type IntWritable

2008-04-21 Thread Doug Cutting

CloudyEye wrote:

What else do I have to override in ArrayWritable  to get the IntWritable
values written to the output files by the reducers?


public String toString();

Doug


Re: jar files on NFS instead of DistributedCache

2008-04-21 Thread Ted Dunning


I agree with the fair and balanced part.  I always try to keep my clusters
fair and balanced!

Joydeep should mention his background.  In any case, I agree that high-end
filers may provide good enough NFS service, but I would also contend that
HDFS has been better for me than NFS from generic servers.


On 4/21/08 2:12 PM, Joydeep Sen Sarma [EMAIL PROTECTED] wrote:

 this is not to say that this is the right solution for large clusters or those
 trying to run nfs servers of linux (which, last i heard, has a notoriously bad
 nfs server). (perhaps open-solaris is a better option).
 
 'fair-and-balanced' :-)



Re: jar files on NFS instead of DistributedCache

2008-04-21 Thread Allen Wittenauer
On 4/21/08 2:18 PM, Ted Dunning [EMAIL PROTECTED] wrote:
 I agree with the fair and balanced part.  I always try to keep my clusters
 fair and balanced!
 
 Joydeep should mention his background.  In any case, I agree that high-end
 filers may provide good enough NFS service, but I would also contend that
 HDFS has been better for me than NFS from generic servers.

We take a mixed approach to the NFS problem.

For grids that have some sort of service level agreement associated with
it, we do not allow NFS connections.  The jobs must be reasonably self
contained.

For other grids (research, development, etc), we do allow NFS
connections and hope that people don't do stupid things.

It is probably worth pointing out that it is much easier for a user to
do stupid things with, say, 500 nodes than 5. So we take a much a more
conservative view for grids we care about.

As Joydeep said, the implementation of the stack does make a huge
difference.  NetApp and Sun are leaps and bounds better than most.  In the
case of Linux, it has made great strides forward but I'd be leary using it
for the sorts of workloads we have.



Run DfsShell command after your job is complete?

2008-04-21 Thread Kayla Jay
Hello -

Is there any way to run a DfsShell command after your job is complete within 
that same job run/main class?

I.e after you're done with the maps and the reduces i want to directly move out 
of hdfs into local file system to load data into a database.

can you run a DfsShell within your job class or within your java class that 
runs the job?

thanks


   
-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: Run DfsShell command after your job is complete?

2008-04-21 Thread lohit
Yes FsShell.java implements most of the Shell commands. You could also use the 
FileSystem API
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html
Simple example http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

Thanks,
Lohit

- Original Message 
From: Kayla Jay [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Monday, April 21, 2008 7:40:52 PM
Subject: Run DfsShell command after your job is complete?

Hello -

Is there any way to run a DfsShell command after your job is complete within 
that same job run/main class?

I.e after you're done with the maps and the reduces i want to directly move out 
of hdfs into local file system to load data into a database.

can you run a DfsShell within your job class or within your java class that 
runs the job?

thanks


   
-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.




Re: How to instruct Job Tracker to use certain hosts only

2008-04-21 Thread Owen O'Malley


On Apr 18, 2008, at 1:52 PM, Htin Hlaing wrote:

I would like to run the first job to run on all the compute hosts  
in the
cluster (which is by default) and then, I would like to run the  
second job

with only on  a subset of the hosts (due to some licensing issue).


One option would be to set mapred.map.max.attempts and  
mapred.reduce.max.attempts to larger numbers and have the map or  
reduce fail if it is run on a bad node. When the task re-runs, it  
will run on a different node. Eventually it will find a valid node.


-- Owen