lot's of small jobs

2008-10-26 Thread Shirley Cohen

Hi,

I have lot's of small jobs and would like to compute the aggregate  
running time of all the mappers and reducers in my job history rather  
than tally the numbers by hand through the web interface. I know that  
the Reporter object can be used to output performance numbers for a  
single job, but is there a mechanism to do so across multiple jobs?


Thank you,

Shirley




dfs i/o stats

2008-09-29 Thread Shirley Cohen

Hi,

I would like to measure the disk i/o performance of our hadoop  
cluster. However, running iostat on 16 nodes is rather cumbersome.  
Does dfs keep track of any stats like the number of blocks or bytes  
read and written? From scanning the api, I found a class called  
org.apache.hadoop.fs.FileSystem.Statistics that could be relevant.  
Does anyone know if this is what I'm looking for?


Thanks,

Shirley


Re: dfs i/o stats

2008-09-29 Thread Shirley Cohen

Great!

Thanks very much,

Shirley

On Sep 29, 2008, at 7:37 PM, Elia Mazzawi wrote:


you can see those stats for each job in the job tracker web interface
http://yourdfsmaster.com:50030/jobtracker.jsp
click on the job link to get the stats

since its in the web interface there is probably a command to get it.

Shirley Cohen wrote:

Hi,

I would like to measure the disk i/o performance of our hadoop  
cluster. However, running iostat on 16 nodes is rather cumbersome.  
Does dfs keep track of any stats like the number of blocks or  
bytes read and written? From scanning the api, I found a class  
called org.apache.hadoop.fs.FileSystem.Statistics that could be  
relevant. Does anyone know if this is what I'm looking for?


Thanks,

Shirley




Re: dfs i/o stats

2008-09-29 Thread Shirley Cohen

Thanks for the helpful pointers!

Shirley

On Sep 29, 2008, at 8:02 PM, Konstantin Shvachko wrote:


We use TestDFSIO for measuring IO performance on our clusters.
It is called a test, but in fact its a benchmark.
It runs a map-reduce job, which either writes to or reads from files
and collects statistics.

Another thing is that Hadoop automatically collects metrics.
Like number of creates, deletes, ls's etc.
Here are some links:
http://wiki.apache.org/hadoop/GangliaMetrics
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/ 
dfs/NameNodeMetrics.html
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/ 
dfs/FSNamesystemMetrics.html


Hope this is helpful.
--Konstantin

Shirley Cohen wrote:

Hi,
I would like to measure the disk i/o performance of our hadoop  
cluster. However, running iostat on 16 nodes is rather cumbersome.  
Does dfs keep track of any stats like the number of blocks or  
bytes read and written?  From scanning the api, I found a class  
called org.apache.hadoop.fs.FileSystem.Statistics that could be  
relevant. Does anyone know if this is what I'm looking for?

Thanks,
Shirley




job details

2008-09-26 Thread Shirley Cohen

Hi,

I'm trying to figure out which log files are used by the job  
tracker's web interface to display the following information:


Job Name: my job
Job File: hdfs://localhost:9000/tmp/hadoop-scohen/mapred/system/ 
job_200809260816_0001/job.xml

Status: Succeeded
Started at: Fri Sep 26 08:18:04 CDT 2008
Finished at: Fri Sep 26 08:18:25 CDT 2008
Finished in: 20sec

What I would like to do is backup the log files that are needed to  
display this information so that we can look at it later if the need  
arises.  When I copy everything from the hadoop home/logs directory  
into another hadoop home/logs directory, the jobs show up in the  
history page. However, all I see are the job name and starting time,  
but not the completion time or the length of the job.


Does anyone have any suggestions?

Thanks,

Shirley


Re: output multiple values?

2008-09-10 Thread Shirley Cohen
Thanks Owen! I found the bug in my code: Doing collect twice does  
work now :))


Shirley

On Sep 9, 2008, at 4:19 PM, Owen O'Malley wrote:



On Sep 9, 2008, at 12:20 PM, Shirley Cohen wrote:

I have a simple reducer that computes the average by doing a sum/ 
count. But I want to output both the average and the count for a  
given key, not just the average. Is it possible to output both  
values from the same invocation of the reducer? Or do I need two  
reducer invocations? If I try to call output.collect() twice from  
the reducer and label the key with type=avg or type=count, I  
get a bunch of garbage out. Please let me know if you have any  
suggestions.


I'd be tempted to define a type like:

class AverageAndCount implements Writable {
  private long sum;
  private long count;
  ...
  public String toString() {
 return avg =  + (sum / (double) count) + , count =  + count);
  }
}

Then you could use your reducer as both a combiner and reducer and  
you would get both values out if you use TextOutputFormat. That  
said, it should absolutely work to do collect twice.


-- Owen




output multiple values?

2008-09-09 Thread Shirley Cohen
I have a simple reducer that computes the average by doing a sum/ 
count. But I want to output both the average and the count for a  
given key, not just the average. Is it possible to output both values  
from the same invocation of the reducer? Or do I need two reducer  
invocations? If I try to call output.collect() twice from the reducer  
and label the key with type=avg or type=count, I get a bunch of  
garbage out. Please let me know if you have any suggestions.


Thanks,

Shirley


no output from job run on cluster

2008-09-04 Thread Shirley Cohen

Hi,

I'm running on hadoop-0.18.0. I have a m-r job that executes  
correctly in standalone mode. However, when run on a cluster, the  
same job produces zero output. It is very bizarre. I looked in the  
logs and couldn't find anything unusual. All I see are the usual  
deprecated filesystem name warnings. Has this ever happened to  
anyone? Do you have any suggestions on how I might go about  
diagnosing the problem?


Thanks,

Shirley


Re: no output from job run on cluster

2008-09-04 Thread Shirley Cohen

Hi Dmitry,

Thanks for your suggestion. I checked and the other systems on the  
cluster do seem to have java installed. I was also able to run the  
job in single mode on the cluster. However, as soon as I add the  
others 15 nodes to the slaves file and re-run the job, the problem  
appears (i.e. there is zero output).


I guess I was going to wait to see if anyone else have seen this  
problem before submitting a bug report.


Shirley

On Sep 4, 2008, at 1:37 PM, Dmitry Pushkarev wrote:


Hi,

I'd check java version installed, that was the problem in my case, and
surprisingly no output from hadoop. If it help - can you submit bug  
request

? :)

-Original Message-
From: Shirley Cohen [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 04, 2008 10:07 AM
To: core-user@hadoop.apache.org
Subject: no output from job run on cluster

Hi,

I'm running on hadoop-0.18.0. I have a m-r job that executes
correctly in standalone mode. However, when run on a cluster, the
same job produces zero output. It is very bizarre. I looked in the
logs and couldn't find anything unusual. All I see are the usual
deprecated filesystem name warnings. Has this ever happened to
anyone? Do you have any suggestions on how I might go about
diagnosing the problem?

Thanks,

Shirley





Output directory already exists

2008-09-02 Thread Shirley Cohen

Hi,

I'm trying to write the output of two different map-reduce jobs into  
the same output directory. I'm using MultipleOutputFormats to set the  
filename dynamically, so there is no filename collision between the  
two jobs. However, I'm getting the error output directory already  
exists.


Does the framework support this functionality? It seems silly to have  
to create a temp directory to store the output files from the second  
job and then have to copy them to the first job's output directory  
after the second job completes.


Thanks,

Shirley



Re: MultipleOutputFormat versus MultipleOutputs

2008-08-29 Thread Shirley Cohen

Thanks, Benjamin. Your example saved me a lot of time :))

Shirley

On Aug 28, 2008, at 8:03 AM, Benjamin Gufler wrote:


Hi Shirley,

On 2008-08-28 14:32, Shirley Cohen wrote:

Do you have an example that shows how to use MultipleOutputFormat?


using MultipleOutputFormat is actually pretty easy. Derive a class  
from

it, overriding - if you want to base the destination file name on the
key and/or value - the method generateFileNameForKeyValue. I'm using
it this way:

protected String generateFileNameForKeyValue(K key, V value,
String name) {
return name + - + key.toString();
}

Pay attention at not generating too many different file names,  
however:
All the files are kept open until the Reducer terminates, and  
operating

systems usually impose a limit on open files you can have.

Also, if you haven't done so yet, please upgrade to the latest  
release,

0.18, if you want to use MultipleOutputFormat. Up to 0.17.2, there was
some trouble with Reducers having more than one output file (see
HADOOP-3639 for the details).

Benjamin




Re: MultipleOutputFormat versus MultipleOutputs

2008-08-28 Thread Shirley Cohen

Thanks, Alejandro!

Do you have an example that shows how to use MultipleOutputFormat?

Shirley

On Aug 28, 2008, at 2:41 AM, Alejandro Abdelnur wrote:


Besides the usage pattern the key differences are:

Different MultipleOutputs outputs can have different outputformats and
key/value classes. With MultipleOutputFormat you can't.

(and if I'm not mistaken) If using MultipleOutputFormat in a map you
can't have a reduce phase. With MultipleOutputs you can.

A

On Thu, Aug 28, 2008 at 3:36 AM, Shirley Cohen  
[EMAIL PROTECTED] wrote:

Hi,

I would like the reducer to output to different files based upon  
the value
of the key. I understand that both MultipleOutputs and  
MultipleOutputFormat
can do this. Is that correct? However, I don't understand the  
differences
between these two classes. Can someone explain the differences and  
provide
an example to illustrate these differences? I found a snippet of  
code on how
to use MultipleOutputs in the documentation, but could not find an  
example

for using MultipleOutputFormat.

Thanks in advance,

Shirley







MultipleOutputFormat versus MultipleOutputs

2008-08-27 Thread Shirley Cohen

Hi,

I would like the reducer to output to different files based upon the  
value of the key. I understand that both MultipleOutputs and  
MultipleOutputFormat can do this. Is that correct? However, I don't  
understand the differences between these two classes. Can someone  
explain the differences and provide an example to illustrate these  
differences? I found a snippet of code on how to use MultipleOutputs  
in the documentation, but could not find an example for using  
MultipleOutputFormat.


Thanks in advance,

Shirley




distinct count

2008-08-26 Thread Shirley Cohen

Hi,

What is the best way to do a distinct count in m-r? Is there any way  
of doing it with one reduce instead of two?


Thanks,

Shirley


data partitioning question

2008-08-04 Thread Shirley Cohen

Hi,

I want to implement some data partitioning logic where a mapper is  
assigned a specific range of values. Here is a concrete example of  
what I have in mind:


Suppose I have attributes A, B, C and the following tuples:

(A, B, C)
(1, 3, 1)
(1, 2, 2)
(1, 2, 3)
(12, 3, 4)
(12, 2, 5)
(12, 8, 6)
(12,  2, 7)

What I want to do is assign mapper x all the tuples where the C  
attribute = 1, 3, 5, and 7.


1-Is it possible to write a smart InputFormat class that can assign a  
set of records to a specific mapper? If so, how?
2-How will this type of partitioning logic interact with HDFS data  
locality?



Thanks,

Shirley



Re: data partitioning question

2008-08-04 Thread Shirley Cohen
Thanks, Qin. It sounds like you're saying that this type of  
partitioning needs its own map-reduce set.


I was hoping it could be done in the InputFormat class :))

Shirley

On Aug 4, 2008, at 2:49 PM, Qin Gao wrote:


For the first question, I think it is better to do it at reduce stage,
because the partitioner only consider the size of blocks in bytes.  
Instead

you can output the intermediate key/value pair as this:

key: 1 if C=1,3,5,7. 0 otherwise
value: the tuple.

In reducer you can have a reducer deal with all the key with  
c=1,3,5,7.


On Mon, Aug 4, 2008 at 3:29 PM, Shirley Cohen  
[EMAIL PROTECTED] wrote:



Hi,

I want to implement some data partitioning logic where a mapper is  
assigned
a specific range of values. Here is a concrete example of what I  
have in

mind:

Suppose I have attributes A, B, C and the following tuples:

(A, B, C)
(1, 3, 1)
(1, 2, 2)
(1, 2, 3)
(12, 3, 4)
(12, 2, 5)
(12, 8, 6)
(12,  2, 7)

What I want to do is assign mapper x all the tuples where the C  
attribute =

1, 3, 5, and 7.

1-Is it possible to write a smart InputFormat class that can  
assign a set

of records to a specific mapper? If so, how?
2-How will this type of partitioning logic interact with HDFS data
locality?


Thanks,

Shirley






Could not find any valid local directory for task

2008-08-03 Thread Shirley Cohen

Hi,

Does anyone know what the following error means?

hadoop-0.16.4/logs/userlogs/task_200808021906_0002_m_14_2]$ cat  
syslog
2008-08-02 20:28:00,443 INFO  
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics  
with processName=MAP, sessionId=
2008-08-02 20:28:00,684 INFO org.apache.hadoop.mapred.MapTask:  
numReduceTasks: 15
2008-08-02 20:30:08,594 WARN org.apache.hadoop.mapred.TaskTracker:  
Error running child

java.io.IOException
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush 
(MapTask.java:719)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
at org.apache.hadoop.mapred.TaskTracker$Child.main 
(TaskTracker.java:2084)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:  
Could not find any valid local directory for  
task_200808021906_0002_m_14_2/spill4.out


Please let me know if you need more information about my setup.

Thanks in advance,

Shirley


No locks available error

2008-08-01 Thread Shirley Cohen

Hi,

We're getting the following error when starting up hadoop on the  
cluster:


2008-08-01 14:42:37,334 INFO org.apache.hadoop.dfs.DataNode:  
STARTUP_MSG:

/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node5.cube.disc.cias.utexas.edu/129.116.113.77
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.16.4
STARTUP_MSG:   build = http://svn.apache.org/repos/asf/hadoop/core/ 
branches/branch-0.16 -r 652614; compiled by 'hadoopqa' on Fri May  2  
00:18:12 UTC 2008

/
2008-08-01 14:43:37,572 INFO org.apache.hadoop.dfs.Storage:  
java.io.IOException: No locks available

at sun.nio.ch.FileChannelImpl.lock0(Native Method)
at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:822)
at java.nio.channels.FileChannel.tryLock(FileChannel.java:967)
at org.apache.hadoop.dfs.Storage$StorageDirectory.lock 
(Storage.java:393)
at org.apache.hadoop.dfs.Storage 
$StorageDirectory.analyzeStorage(Storage.java:278)
at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead 
(DataStorage.java:103)
at org.apache.hadoop.dfs.DataNode.startDataNode 
(DataNode.java:236)

at org.apache.hadoop.dfs.DataNode.init(DataNode.java:162)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java: 
2512)

at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2456)
at org.apache.hadoop.dfs.DataNode.createDataNode 
(DataNode.java:2477)

at org.apache.hadoop.dfs.DataNode.main(DataNode.java:2673)

This error appears on every data node during startup.  We are running  
version 0.16.4 of hadoop and the hadoop dfs is NSF mounted on all the  
nodes in the cluster.


Does anyone know what this error means?

Thanks,

Shirley





iterative map-reduce

2008-07-29 Thread Shirley Cohen

Hi,

I want to call a map-reduce program recursively until some condition  
is met.  How do I do that?


Thanks,

Shirley



Re: iterative map-reduce

2008-07-29 Thread Shirley Cohen
Thanks... would the iterative script be run outside of Hadoop? I was  
actually trying to figure out if the framework could handle iterations.


Shirley

On Jul 29, 2008, at 9:10 AM, Qin Gao wrote:

if you are using java, just create job configure again and run it,  
otherwise

you just need to write a iterative script.

On Tue, Jul 29, 2008 at 9:57 AM, Shirley Cohen  
[EMAIL PROTECTED] wrote:



Hi,

I want to call a map-reduce program recursively until some  
condition is

met.  How do I do that?

Thanks,

Shirley






partitioning the inputs to the mapper

2008-07-27 Thread Shirley Cohen
How do I partition the inputs to the mapper, such that a mapper  
processes an entire file or files? What is happening now is that each  
mapper receives only portions of a file and I want them to receive an  
entire file. Is there a way to do that within the scope of the  
framework?


Thanks,

Shirley 


Re: incremental re-execution

2008-04-21 Thread Shirley Cohen

Hi Ted,

Thanks for your example. It's very interesting to learn about  
specific map reduce applications.


It's non-obvious to me that it's a good idea to combine two map- 
reduce pairs by using the cross product of the intermediate states-  
you might wind up building an O(n^2) intermediate data structure  
instead of two O(n) ones. Even with parallelism this is not good. I'm  
wondering if in your example you're relying on the fact that the  
viewer-video matrix is sparse, so many of the pairs will have value  
0? Does the map phase emit intermediate results with 0-values?


Thanks,

Shirley



Take something like what we see in our logs of viewing.  We have  
several log
entries per view each of which contains an identifier for the  
viewer and for
the video.  These events occur on start, on progress and on  
completion.  We
want to have total views per viewer and total views per video.  You  
can pass
over the logs twice to get this data or you can pass over the data  
once to
get total views per (viewer x video).  This last is a semi- 
aggregated form
that has no utility except that it is much smaller than the  
original data.
Reducing the semi-aggregated from to viewer counts and video counts  
results

in shorter total processing than processing the raw data twice.

If you start with a program that has two map-reduce passes over the  
same
data, it is likely very difficult to intuit that they could use the  
same
intermediate data.  Even with something like Pig, where you have a  
good

representation for internal optimizations, it is probably going to be
difficult to convert the two MR steps into one pre-aggregation and  
two final

aggregations.


On 4/20/08 7:39 AM, Shirley Cohen [EMAIL PROTECTED] wrote:


Hi Ted,

I'm confused about your second comment below: in the case where semi-
aggregated data is used to produce multiple low-level aggregates,
what sorts of detection did you have in mind which would be hard  
to do?


Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:



I re-use outputs of MR programs pretty often, but when I need to
reuse the
map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than
the
original data which in turn can be used to produce multiple low
definition
aggregates.  I would find it very surprising if  you could detect
these
sorts of situations.


On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote:


Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to  
speed

up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be  
valuable

to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map  
job on

many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley










Re: incremental re-execution

2008-04-20 Thread Shirley Cohen

Hi Ted,

I'm confused about your second comment below: in the case where semi- 
aggregated data is used to produce multiple low-level aggregates,  
what sorts of detection did you have in mind which would be hard to do?


Thanks,

Shirley

On Apr 16, 2008, at 7:30 PM, Ted Dunning wrote:



I re-use outputs of MR programs pretty often, but when I need to  
reuse the

map output, I just manually break the process apart into a
map+identity-reducer and the multiple reducers.  This is rare.

It is common to have a semi-aggregated form that is much small than  
the
original data which in turn can be used to produce multiple low  
definition
aggregates.  I would find it very surprising if  you could detect  
these

sorts of situations.


On 4/16/08 5:26 PM, Shirley Cohen [EMAIL PROTECTED] wrote:


Dear Hadoop Users,

I'm writing to find out what you think about being able to
incrementally re-execute a map reduce job. My understanding is that
the current framework doesn't support it and I'd like to know
whether, in your opinion, having this capability could help to speed
up development and debugging.

My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable
to incrementally re-run it?

2) Would it be helpful to save the output from a whole bunch of
mappers and then try to detect whether this output can be re-used
when a new job is launched?

3) Would it be helpful to be able to use the output from a map job on
many reducers?

Please let me know what your thoughts are and what specific
applications you are working on.

Much appreciation,

Shirley






incremental re-execution

2008-04-16 Thread Shirley Cohen

Dear Hadoop Users,

I'm writing to find out what you think about being able to  
incrementally re-execute a map reduce job. My understanding is that  
the current framework doesn't support it and I'd like to know  
whether, in your opinion, having this capability could help to speed  
up development and debugging.


My specific questions are:

1) Do you have to re-run a job often enough that it would be valuable  
to incrementally re-run it?


2) Would it be helpful to save the output from a whole bunch of  
mappers and then try to detect whether this output can be re-used  
when a new job is launched?


3) Would it be helpful to be able to use the output from a map job on  
many reducers?


Please let me know what your thoughts are and what specific  
applications you are working on.


Much appreciation,

Shirley