Re: Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread Billy Pearson
The only way to do something like this is get them mapers to use something 
like /dev/shm as there storage folder that's 100% memory
outside of that everything is flushed because the mapper exits when its done 
the tasktracker is the one delivering the output to the reduce task.


Billy



"paula_ta"  wrote in 
message news:23617347.p...@talk.nabble.com...


Is it possible that some intermediate data produced by mappers and written 
to

the local file system resides in memory in the file system cache and is
never flushed to disk ?  Eventually reducers will retrieve this data via
HTTP - possibly without the data ever being written to disk ?

thanks
Paula

--
View this message in context: 
http://www.nabble.com/Is-intermediate-data-produced-by-mappers-always-flushed-to-disk---tp23617347p23617347.html

Sent from the Hadoop core-user mailing list archive at Nabble.com.







Re: Hadoop & Python

2009-05-19 Thread Billy Pearson
I used streaming and php before to work with processing data with a data set 
of about 1TB with out any problems at all.


Billy


"s d"  wrote in message 
news:24b53fa00905191035w41b115c1q94502ee82be43...@mail.gmail.com...

Thanks.
So in the overall scheme of things, what is the general feeling about 
using

python for this? I like the ease of deploying and reading python compared
with Java but want to make sure using python over hadoop is scalable & is
standard practice and not something done only for prototyping and small
scale tests.


On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard 
 wrote:


Streaming is slightly slower than native Java jobs.  Otherwise Python 
works

great in streaming.

Alex

On Tue, May 19, 2009 at 8:36 AM, s d 
 wrote:


> Hi,
> How robust is using hadoop with python over the streaming protocol? Any
> disadvantages (performance? flexibility?) ?  It just strikes me that
python
> is so much more convenient when it comes to deploying and crunching 
> text

> files.
> Thanks,
>








Re: Regarding Capacity Scheduler

2009-05-14 Thread Billy Pearson



I am seeing the the same problem posted on the list on the 11th and have not 
any reply.


Billy


- Original Message - 
From: "Manish Katyal" 


Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: 
Sent: Wednesday, May 13, 2009 11:48 AM
Subject: Regarding Capacity Scheduler



I'm experimenting with the Capacity scheduler (0.19.0) in a multi-cluster
environment.
I noticed that unlike the mappers, the reducers are not pre-empted?

I have two queues (high and low) that are each running big jobs (70+ maps
each).  The scheduler splits the mappers as per the queue
guaranteed-capacity (5/8ths for the high and the rest for the low). 
However,

the reduce jobs are not interleaved -- the reduce job in the high queue is
blocked waiting for the reduce job in the low queue to complete.

Is this a bug or by design?

*Low queue:*
Guaranteed Capacity (%) : 37.5
Guaranteed Capacity Maps : 3
Guaranteed Capacity Reduces : *3*
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 3
Number of Running Reduces : *7*
Number of Waiting Maps : 131
Number of Waiting Reduces : 0
Priority Supported : NO

*High queue:*
Guaranteed Capacity (%) : 62.5
Guaranteed Capacity Maps : 5
Guaranteed Capacity Reduces : 5
User Limit : 100
Reclaim Time limit : 300
Number of Running Maps : 4
Number of Running Reduces : *0*
Number of Waiting Maps : 68
Number of Waiting Reduces : *7*
Priority Supported : NO






Capacity Scheduler?

2009-05-11 Thread Billy Pearson
Does the Capacity Scheduler not recover reduce tasks in the setting 
mapred.capacity-scheduler.queue.{name}.reclaim-time-limit?
on my test it only recovers map task if it can not get its full Guaranteed 
Capacity.


Billy





Re: How to do load control of MapReduce

2009-05-11 Thread Billy Pearson
Might try setting the tasktrackers linux nice level to say 5 or 10 leavening 
dfs and hbase setting to 0


Billy
"zsongbo"  wrote in message 
news:fa03480d0905110549j7f09be13qd434ca41c9f84...@mail.gmail.com...

Hi all,
Now, if we have a large dataset to process by MapReduce. The MapReduce 
will

take machine resources as many as possible.

So when one such a big MapReduce job are running, the cluster would become
very busy and almost cannot do anything else.

For example, we have a HDFS+MapReduc+HBase cluster.
There are a large dataset in HDFS to be processed by MapReduce 
periodically,

the workload is CPU and I/O heavy. And the cluster also provide other
service for query (query HBase and read files in HDFS). So, when the job 
is

running, the query latency will become very long.

Since the MapReduce job is not time sensitive, I want to control the load 
of

MapReduce. Do you have some advices ?

Thanks in advance.
Schubert






Re: Logging in Hadoop Stream jobs

2009-05-09 Thread Billy Pearson
When I was looking to capture debugging data about my scripts I would just 
write to stderr stream in php it like

fwrite(STDERR,"message you want here");

then it get captured in the task logs when you view the detail of each task.

Billy


"Mayuran Yogarajah" 
 wrote in message 
news:4a049154.6070...@casalemedia.com...

How do people handle logging in a Hadoop stream job?

I'm currently looking at using syslog for this but would like to know of 
other ways

people are doing this currently.

thanks






Re: Sequence of Streaming Jobs

2009-05-02 Thread Billy Pearson
In php I run exec commands with the job commands and it has a variable that 
stores the exit status code.


Billy




"Mayuran Yogarajah" 
 wrote in message 
news:49fc975a.3030...@casalemedia.com...

Billy Pearson wrote:
I done this with and array of commands for the jobs in a php script 
checking

the return of the job to tell if it failed or not.

Billy


I have this same issue.. How do you check if a job failed or not? You 
mentioned checking

the return code? How are you doing that ?

thanks
"Dan Milstein"  wrote 
in
message 
news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com...



If I've got a sequence of streaming jobs, each of which depends on the
output of the previous one, is there a good way to launch that 
sequence?

Meaning, I want step B to only start once step A has  finished.

From within Java JobClient code, I can do submitJob/runJob, but is 
there

any sort of clean way to do this for a sequence of streaming jobs?

Thanks,
-Dan Milstein














Re: Sequence of Streaming Jobs

2009-05-02 Thread Billy Pearson
I done this with and array of commands for the jobs in a php script checking 
the return of the job to tell if it failed or not.


Billy

"Dan Milstein"  wrote in 
message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com...
If I've got a sequence of streaming jobs, each of which depends on the 
output of the previous one, is there a good way to launch that  sequence? 
Meaning, I want step B to only start once step A has  finished.


From within Java JobClient code, I can do submitJob/runJob, but is  there 
any sort of clean way to do this for a sequence of streaming jobs?


Thanks,
-Dan Milstein






Re: How to run many jobs at the same time?

2009-04-21 Thread Billy Pearson
The only way I know of is try using different Scheduling Queue's for each 
group


Billy

"nguyenhuynh.mr"  
wrote in message news:49ee6e56.7080...@gmail.com...

Tom White wrote:


You need to start each JobControl in its own thread so they can run
concurrently. Something like:

Thread t = new Thread(jobControl);
t.start();

Then poll the jobControl.allFinished() method.

Tom

On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr
 wrote:


Hi all!


I have some jobs: job1, job2, job3,... . Each job working with the
group. To control jobs, I have JobControllers, each JobController
control  jobs  follow the  specified  group.


Example:

- Have 2 Group: g1 and g2

-> 2 JobController: jController1, jcontroller2

 + jController1 contains jobs: job1, job2, job3, ...

 + jController2 contains jobs: job1, job2, job3, ...


* To run jobs, I sue:

for (i=0; i<2; i++){

   jCtrl[i]= new jController(group i);

   jCtrl[i].run();

}


* I want jController1 and jController2 run parallel. But actual, when
jController1 finished,  jController2 begin run.


Why?

Please help me!


* P/s: jController use org.apache.hadoop.mapred.jobcontrol.JobControl


Thanks,


cheer,

Nguyen.







Thanks for your response!

I have used Thread to start JobControl, some things like:

public class JobController{

   public JobController(String g){
  .
   }

   public run(){
  Job j1 = new Job(..);
  Job j2 =new Job(..);
  JobControl jc = new JobControl("group1");

  Threat t=new Thread(jc);
  t.start();

 while(! jc.allFinish()){
 // Display state 
  }
   }
}

* To run the code some like:
JobController[] jController=new JController[2];
for (int i=0; i<2; i++){
   jController[i]=new JobController(group[i]);
   JCOntroller[i].run();

}

* But not parallel run :( !

Please help me!

Thanks,

Best regards,
Nguyen,







Re: NameNode resilency

2009-04-11 Thread Billy Pearson
Not 100% sure but I thank they plan on using zookeeper to help with namenode 
fail over but that may have changed.


Billy

"Stas Oskin"  wrote in 
message news:77938bc20904110243u7a2baa6dw6d710e4e51ae0...@mail.gmail.com...

Hi.

I wonder, what Hadoop community uses in order to make NameNode resilient 
to

failures?

I mean, what High-Availability measures are taken to keep the HDFS 
available

even in case of NameNode failure?

So far I read a possible solution using DRBD, and another one using carp.
Both of them had the downside of keeping a passive machine aside taking 
the

IP of the NameNode.

Perhaps there is a way to keep only a passive NameNode service on another
machine (which does other tasks), taking the IP only when the main has
failed?

That of course until the human operator restores the main node to action.\

Regards.






Re: Reduce task attempt retry strategy

2009-04-06 Thread Billy Pearson

I seen the same thing happening on 0.19.branch.

When a task fails on the reduce end it always retries on the same node until 
it kills the job for to many failed tries on one reduce task.


I am running a cluster of 7 nodes.

Billy


"Stefan Will"  wrote in message 
news:c5ff7f91.18c09%stefan.w...@gmx.net...

Hi,

I had a flaky machine the other day that was still accepting jobs and
sending heartbeats, but caused all reduce task attempts to fail. This in
turn caused the whole job to fail because the same reduce task was retried 3
times on that particular machine.

Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
always thought that the framework would retry jobs on a different machine if
retries on the original machine keep failing. E.g. I would have expected to
retry once or twice on the same machine, but then switch to a different one
to minimize the likelihood of getting stuck on a bad machine.

What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
improving on this in the future ?

Thanks,
Stefan




Re: hdfs-doubt

2009-03-29 Thread Billy Pearson
Your client doesn't have to be on the  namenode it can be on any system that 
can access the namenode and the datanodes.
Hadoop uses 64MB block to store files so file sizes >= 64mb should be as 
efficient as 128MB or 1GB file sizes.


more reading and information here:
http://wiki.apache.org/hadoop/HadoopPresentations
http://wiki.apache.org/hadoop/HadoopArticles

Billy

"sree deepya"  wrote in 
message news:258814520903281055u540dc36do71aa26403dd03...@mail.gmail.com...

Hi sir/madam,

I am SreeDeepya,doing Mtech in IIIT.I am working on a project named cost
effective and scalable storage server.Our main goal of the project is to 
be
able to store images in a server and the data can be upto petabytes.For 
that

we are using HDFS.I am new to hadoop and am just learning about it.
   Can you please clarify some of the doubts I have.



At present we configured one datanode and one namenode.Jobtracker is 
running

on namenode and tasktracker on datanode.Now namenode also acts as
client.Like we are writing programs in the namenode to store or retrieve
images.My doubts are

1.Can we put the client and namenode in two separate systems?

2.Can we access the images from the datanode of hadoop cluster from a
machine in which hdfs is not there?

3.At present we may not have data upto petabytes but will be in 
gigabytes.Is

hadoop still efficient in storing mega and giga bytes of data


Thanking you,

Yours sincerely,
SreeDeepya






Re: Reducer handing at 66%

2009-03-29 Thread Billy Pearson
66% is the start of the reduce function so its likely a endless loop there 
burning the cpu cycles



"Amandeep Khurana"  wrote in 
message news:35a22e220903271631i25ff749bx5814348e66ff4...@mail.gmail.com...

I have a MR job running on approximately 15 lines of data in a text
file. The reducer hangs at 66% and at that moment the cpu usage is at 
100%.

After 600 seconds, it kills the job... What could be going wrong and where
should I look for the problem?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz






Re: Typical hardware configurations

2009-03-29 Thread Billy Pearson
I run 10 node cluster with 2 cores 2.4Ghz with 4Gb Ram and dual 250GB drives 
per node.
I run on used 32 bit servers so I can only run 2GB hbase but I still have 
memory left for tasktracker and datanode.
more files in hadoop = more memory used on the namenode. hbase master is 
lightly loaded so I run my on the same node as namenode


My personal option is a large memory 64bit machine can not be fully loaded 
with hbase at this time but will give you better performance.
Maybe if you have lots of MR jobs or need the netter response then it would 
be worth it.
I thank there is still some open issues on to many open file handles etc 
that can limit larger server to not be fully used to there capacity.
Thank in terms of google they stick with low (cheap to replace) hard drive 
sizes medium memory and cost/performance cpus but have lots of them.


Billy

"Amandeep Khurana"  wrote in 
message news:35a22e220903272207s30f26310y3ecbec723b83e...@mail.gmail.com...

What are the typical hardware config for a node that people are using for
Hadoop and HBase? I am setting up a new 10 node cluster which will have
HBase running as well that will be feeding my front end directly. 
Currently,
I had a 3 node cluster with 2 GB of RAM on the slaves and 4 GB of RAM on 
the

master. This didnt work very well due to the RAM being a little low.

I got some config details from the powered by page on the Hadoop wiki, but
nothing like that for Hbase.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz






Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson

mapred.jobtracker.retirejob.interval
is not in the default config

should this not be in the config?

Billy



"Amar Kamat"  wrote in 
message news:49caff11.8070...@yahoo-inc.com...

Amar Kamat wrote:

Amareshwari Sriramadasu wrote:

Set mapred.jobtracker.retirejob.interval

This is used to retire completed jobs.

and mapred.userlog.retain.hours to higher value.

This is used to discard user logs.
As Amareshwari pointed out, this might be the cause. Can you increase this 
value and try?

Amar
By default, their values are 24 hours. These might be the reason for 
failure, though I'm not sure.


Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after 
24 hours all

active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them 
again.
What is the state of the reducer (copy or sort)? Check 
jobtracker/task-tracker logs to see what is the state of these reducers 
and whether it issued a kill signal. Either jobtracker/tasktracker is 
issuing a kill signal or the reducers are committing suicide. Were there 
any failures on the reducer side while pulling the map output? Also what 
is the nature of the job? How fast the maps finish?

Amar


Billy














Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson
lues are 24 hours. These might be the reason for 
failure, though I'm not sure.


Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that after 
24 hours all

active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them 
again.
What is the state of the reducer (copy or sort)? Check 
jobtracker/task-tracker logs to see what is the state of these reducers 
and whether it issued a kill signal. Either jobtracker/tasktracker is 
issuing a kill signal or the reducers are committing suicide. Were there 
any failures on the reducer side while pulling the map output? Also what 
is the nature of the job? How fast the maps finish?

Amar


Billy












reduce task failing after 24 hours waiting

2009-03-25 Thread Billy Pearson
I am seeing on one of my long running jobs about 50-60 hours that after 24 
hours all

active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them again.

Billy




Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson

I opened a issue here
https://issues.apache.org/jira/browse/HADOOP-5539

If you would like to comment on it.

Billy

"Stefan Will"  wrote in message 
news:c5e7dc6d.1840d%stefan.w...@gmx.net...
I noticed this too. I think the compression only applies to the final 
mapper

and reducer outputs, but not any intermediate files produced. The reducer
will decompress the map output files after copying them, and then compress
its own output only after it has finished.

I wonder if this is by design, or just an oversight.

-- Stefan


From: Billy Pearson 


Reply-To: 
Date: Wed, 18 Mar 2009 22:14:07 -0500
To: 
Subject: Re: intermediate results not getting compressed

I can run head on the map.out files and I get compressed garbish but I 
run
head on a intermediate file and I can read the data in the file clearly 
so
compression is not getting passed but I am setting the CompressMapOutput 
to

true by default in my hadoop-site.conf file.

Billy


"Billy Pearson" 
wrote in message news:gpscu3$66...@ger.gmane.org...

the intermediate.X files are not getting compresses for some reason  not
sure why
I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not
working correctly the compression is not passing from the map output 
files

to the on disk merge of the intermediate.X files

tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 3 files left.
2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: 
Keeping

3 segments, 39835369 bytes in memory for intermediate, on-disk merge
2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: 
Merging

1730 files, 70359998581 bytes from disk
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: 
Merging

0 segments, 0 bytes from memory into reduce
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 
1733

sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1306

See the size of the files is about ~70GB (70359998581) these are
compressed at this points its went from 1733 file to 1306 left to merge
and the intermediate.X files are well over 200Gb at this point and we 
are

not even close to done. If compression is working we should not see task
failing at this point in the task becuase lack of hard drvie space sense
as we merge we delete the merged file from the output folder.

I only see this happening when there are to many files left that did not
get merged durring the shuffle stage and it starts on disk mergeing.
the task that complete the merges and keep it below the io.sort size in 
my

case 30 skips the on disk merge and complete useing normal hard drive
space.

Anyone care to take a look?
This job takes two or more days to get to this point so getting kind of 
a

pain in the butt to run and watch the reduces fail and the job keep
failing no matt

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson

open issue
https://issues.apache.org/jira/browse/HADOOP-5539

Billy


"Billy Pearson"  
wrote in message news:cecf0598d9ca40a08e777568361de...@billypc...





How are you concluding that the intermediate output is compressed from 
the map, but not in the reduce? -C


my hadoop-site.xml


 mapred.compress.map.output
 true
 Should the job outputs be compressed?
 


 mapred.output.compression.type
 BLOCK
 If the job outputs are to compressed as SequenceFiles, how 
should

  they be compressed? Should be one of NONE, RECORD or BLOCK.
 



from the job.xml

mapred.output.compress = false // final output
mapred.compress.map.output = true // map output

+ I can head the files from comand line and read the key / value in the 
reduce intermediate merges but not the map.out files.









Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson
If CompressMapOutput then it should carry all the way to the reduce 
including map.out files and intermediate
I added some logging to the Merger I have to wait until some more jobs 
finish before I can rebuild and restart to see the logging
but that will confirm weather or not the codec is null when it gets to line 
432 and the writer is created for the intermediate files.


if its null I will open a issue.

Billy


"Stefan Will"  wrote in message 
news:c5e7dc6d.1840d%stefan.w...@gmx.net...
I noticed this too. I think the compression only applies to the final 
mapper

and reducer outputs, but not any intermediate files produced. The reducer
will decompress the map output files after copying them, and then compress
its own output only after it has finished.

I wonder if this is by design, or just an oversight.

-- Stefan


From: Billy Pearson 


Reply-To: 
Date: Wed, 18 Mar 2009 22:14:07 -0500
To: 
Subject: Re: intermediate results not getting compressed

I can run head on the map.out files and I get compressed garbish but I 
run
head on a intermediate file and I can read the data in the file clearly 
so
compression is not getting passed but I am setting the CompressMapOutput 
to

true by default in my hadoop-site.conf file.

Billy


"Billy Pearson" 
wrote in message news:gpscu3$66...@ger.gmane.org...

the intermediate.X files are not getting compresses for some reason  not
sure why
I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not
working correctly the compression is not passing from the map output 
files

to the on disk merge of the intermediate.X files

tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask:
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 3 files left.
2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: 
Keeping

3 segments, 39835369 bytes in memory for intermediate, on-disk merge
2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: 
Merging

1730 files, 70359998581 bytes from disk
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: 
Merging

0 segments, 0 bytes from memory into reduce
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 
1733

sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30
intermediate segments out of a total of 1306

See the size of the files is about ~70GB (70359998581) these are
compressed at this points its went from 1733 file to 1306 left to merge
and the intermediate.X files are well over 200Gb at this point and we 
are

not even close to done. If compression is working we should not see task
failing at this point in the task becuase lack of hard drvie space sense
as we merge we delete the merged file from the output folder.

I only see this happening when there are to many files left that did not
get merged durring the shuffle stage and it starts on disk mergeing.
the task that complete the merges a

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson





How are you concluding that the intermediate output is compressed from 
the map, but not in the reduce? -C


my hadoop-site.xml


 mapred.compress.map.output
 true
 Should the job outputs be compressed?
 


 mapred.output.compression.type
 BLOCK
 If the job outputs are to compressed as SequenceFiles, how 
should

  they be compressed? Should be one of NONE, RECORD or BLOCK.
 



from the job.xml

mapred.output.compress = false // final output
mapred.compress.map.output = true // map output

+ I can head the files from comand line and read the key / value in the 
reduce intermediate merges but not the map.out files.





Re: intermediate results not getting compressed

2009-03-18 Thread Billy Pearson
I can run head on the map.out files and I get compressed garbish but I run 
head on a intermediate file and I can read the data in the file clearly so 
compression is not getting passed but I am setting the CompressMapOutput to 
true by default in my hadoop-site.conf file.


Billy


"Billy Pearson"  
wrote in message news:gpscu3$66...@ger.gmane.org...
the intermediate.X files are not getting compresses for some reason  not 
sure why

I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not 
working correctly the compression is not passing from the map output files 
to the on disk merge of the intermediate.X files


tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask: 
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: 
In-memory merge complete: 3 files left.
2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 
3 segments, 39835369 bytes in memory for intermediate, on-disk merge
2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
1730 files, 70359998581 bytes from disk
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
0 segments, 0 bytes from memory into reduce
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733 
sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22 
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1306


See the size of the files is about ~70GB (70359998581) these are 
compressed at this points its went from 1733 file to 1306 left to merge 
and the intermediate.X files are well over 200Gb at this point and we are 
not even close to done. If compression is working we should not see task 
failing at this point in the task becuase lack of hard drvie space sense 
as we merge we delete the merged file from the output folder.


I only see this happening when there are to many files left that did not 
get merged durring the shuffle stage and it starts on disk mergeing.
the task that complete the merges and keep it below the io.sort size in my 
case 30 skips the on disk merge and complete useing normal hard drive 
space.


Anyone care to take a look?
This job takes two or more days to get to this point so getting kind of a 
pain in the butt to run and watch the reduces fail and the job keep 
failing no matter what.


I can post the tail of this task long when it fails to show you how far it 
gets before it runs out of space. before redcue on disk merge starts the 
disk are about 35-40% used on 500GB Drive and two taks runnning at the 
same time.


Billy Pearson







Re: intermediate results not getting compressed

2009-03-18 Thread Billy Pearson
the intermediate.X files are not getting compresses for some reason  not 
sure why

I download and build the latest branch for 0.19

o.a.h.mapred.Merger.class line 432
new Writer(conf, fs, outputFile, keyClass, valueClass, codec);

this seams to use the codec defined above but for some reasion its not 
working correctly the compression is not passing from the map output files 
to the on disk merge of the intermediate.X files


tail task report from one server:

2009-03-18 19:19:02,643 INFO org.apache.hadoop.mapred.ReduceTask: 
Interleaved on-disk merge complete: 1730 files left.
2009-03-18 19:19:02,645 INFO org.apache.hadoop.mapred.ReduceTask: In-memory 
merge complete: 3 files left.
2009-03-18 19:19:02,650 INFO org.apache.hadoop.mapred.ReduceTask: Keeping 3 
segments, 39835369 bytes in memory for intermediate, on-disk merge
2009-03-18 19:19:03,878 INFO org.apache.hadoop.mapred.ReduceTask: Merging 
1730 files, 70359998581 bytes from disk
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 
segments, 0 bytes from memory into reduce
2009-03-18 19:19:03,909 INFO org.apache.hadoop.mapred.Merger: Merging 1733 
sorted segments
2009-03-18 19:19:04,161 INFO org.apache.hadoop.mapred.Merger: Merging 22 
intermediate segments out of a total of 1733
2009-03-18 19:21:43,693 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1712
2009-03-18 19:27:07,033 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1683
2009-03-18 19:33:27,669 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1654
2009-03-18 19:40:38,243 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1625
2009-03-18 19:48:08,151 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1596
2009-03-18 19:57:16,300 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1567
2009-03-18 20:07:34,224 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1538
2009-03-18 20:17:54,715 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1509
2009-03-18 20:28:49,273 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1480
2009-03-18 20:39:28,830 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1451
2009-03-18 20:50:23,706 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1422
2009-03-18 21:01:36,818 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1393
2009-03-18 21:13:09,509 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1364
2009-03-18 21:25:17,304 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1335
2009-03-18 21:36:48,536 INFO org.apache.hadoop.mapred.Merger: Merging 30 
intermediate segments out of a total of 1306


See the size of the files is about ~70GB (70359998581) these are compressed 
at this points its went from 1733 file to 1306 left to merge and the 
intermediate.X files are well over 200Gb at this point and we are not even 
close to done. If compression is working we should not see task failing at 
this point in the task becuase lack of hard drvie space sense as we merge we 
delete the merged file from the output folder.


I only see this happening when there are to many files left that did not get 
merged durring the shuffle stage and it starts on disk mergeing.
the task that complete the merges and keep it below the io.sort size in my 
case 30 skips the on disk merge and complete useing normal hard drive space.


Anyone care to take a look?
This job takes two or more days to get to this point so getting kind of a 
pain in the butt to run and watch the reduces fail and the job keep failing 
no matter what.


I can post the tail of this task long when it fails to show you how far it 
gets before it runs out of space. before redcue on disk merge starts the 
disk are about 35-40% used on 500GB Drive and two taks runnning at the same 
time.


Billy Pearson 





Re: intermediate results not getting compressed

2009-03-17 Thread Billy Pearson
Watching a second job with more reduce task running looks like the in-memory 
merges are working correctly with compression.


The task I was watching failed and was running again it Shuffle all the map 
output files then started the merged after all was copied so non was merged 
in memory it was closed before the merging started.
If it helps the name of the output files is intermediate.x and is stored in 
folder mapred/local/job-taskname/intermediate.x
while the in-memory merges are stored 
mapred/local/taskTracker/jobcache/job-name/taskname/


The non compressed ones are the intermediate.x file above.

Billy


"Chris Douglas"  wrote in 
message news:9bb78c3a-efab-45c3-8cc3-25aab60df...@yahoo-inc.com...
My problem is the output from merging the intermediate map output  files 
is not compresses so I lose all the benefit of compressing the  map file 
output to save disk space because the merged map output  files are no 
longer compressed.


It should still be compressed, unless there's some bizarre regression. 
More segments will be around simultaneously (since the segments not  yet 
merged are still on disk), which clearly puts pressure on  intermediate 
storage, but if the map outputs are compressed, then the  merged map 
outputs at the reduce must also be compressed. There's no  place in the 
intermediate format to store compression metadata, so  either all are or 
none are. Intermediate merges should also follow the  compression spec of 
the initiating merger, too (o.a.h.mapred.Merger: 447).


How are you concluding that the intermediate output is compressed from 
the map, but not in the reduce? -C




- Original Message - From: "Chris Douglas" 

>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: 


Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed


I am running 0.19.1-dev, r744282. I have searched the issues but 
found nothing about the compression.


AFAIK, there are no open issues that prevent intermediate  compression 
from working. The following might be useful:


http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression

Should the intermediate results not be compressed also if the map 
output files are set to be compressed?


These are controlled by separate options.

FileOutputFormat::setCompressOutput enables/disables compression  on 
the final output
JobConf::setCompressMapOutput enables/disables compression of the 
intermediate output


If not then why do we have the map compression option just to save 
network traffic?


That's part of it. Also to save on disk bandwidth and intermediate 
space. -C











Re: intermediate results not getting compressed

2009-03-17 Thread Billy Pearson



I understand that I got CompressMapOutput set and it works the maps outputs 
are compressed but on the reduce end it downloads x files then merges the x 
file in to one intermediate file to keep the number of files to a minimal 
<= io.sort.factor.


My problem is the output from merging the intermediate map output files is 
not compresses so I lose all the benefit of compressing the map file output 
to save disk space because the merged map output files are no longer 
compressed.


Note there are two different type of intermediate files the map outputs then 
one the reduce merges the map outputs to meet the set io.sort.factor.


Billy



- Original Message - 
From: "Chris Douglas" 

Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: 
Sent: Tuesday, March 17, 2009 12:33 AM
Subject: Re: intermediate results not getting compressed


I am running 0.19.1-dev, r744282. I have searched the issues but  found 
nothing about the compression.


AFAIK, there are no open issues that prevent intermediate compression 
from working. The following might be useful:


http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression

Should the intermediate results not be compressed also if the map  output 
files are set to be compressed?


These are controlled by separate options.

FileOutputFormat::setCompressOutput enables/disables compression on  the 
final output
JobConf::setCompressMapOutput enables/disables compression of the 
intermediate output


If not then why do we have the map compression option just to save 
network traffic?


That's part of it. Also to save on disk bandwidth and intermediate 
 space. -C







intermediate results not getting compressed

2009-03-16 Thread Billy Pearson
I am running a large streaming job that processes that about 3TB of data I 
am seeing large jumps in hard drive space usage in the reduce part of the 
jobs I tracked the problem down. The job is set to compress map outputs but 
looking at the intermediate files on the local drives the intermediate files 
are not getting compressed during/after merges. I am going from having say 
2Gb of mapfile.out files to having one intermediate.X file sizing 100-350% 
larger then the map files. I have looked at one of the files and confirmed 
that it is not getting compressed as I can read the data in it. if it was 
only one merge then it would not be a problem but when you are merging 
70-100 of these you use tons of GB's and my task are starting to die as they 
run out of hard drive space end the end kill the job.


I am running 0.19.1-dev, r744282. I have searched the issues but found 
nothing about the compression.
Should the intermediate results not be compressed also if the map output 
files are set to be compressed?
If not then why do we have the map compression option just to save network 
traffic?





Re: Hadoop job using multiple input files

2009-02-06 Thread Billy Pearson

If it was me I would prefix the map values outputs with a: and n:.
a: for address and n: for number
then on the reduce you could test the value to see if its the address or the 
name with if statements no need to worry about which one comes first just 
make sure they both have been set before output on the reduce.


Billy

"Amandeep Khurana"  wrote in 
message news:35a22e220902061646m941a545o554b189ed5bdb...@mail.gmail.com...

Ok. I was able to get this to run but have a slight problem.

*File 1*
1   10
2   20
3   30
3   35
4   40
4   45
4   49
5   50

*File 2*

a   10   123
b   20   21321
c   45   2131
d   40   213

I want to join the above two based on the second column of file 1. Here's
what I am getting as the output.

*Output*
1   a   123
b   21321   2
3
3
4   d   213
c   2131   4
4
5

The ones in red are in the format I want it. The ones in blue have their
order reversed. How can I get them to be in the correct order too?
Basically, the order in which the iterator iterates over the values is not
consistent. How can I get this to be consistent?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 2:58 PM, Amandeep Khurana 
 wrote:



Ok. Got it.

Now, how would my reducer know whether the name is coming first or the
address? Is it going to be in the same order in the iterator as the files
are read (alphabetically) in the mapper?


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Feb 6, 2009 at 5:22 AM, Jeff Hammerbacher 
wrote:



You put the files into a common directory, and use that as your input to
the
MapReduce job. You write a single Mapper class that has an "if" 
statement
examining the map.input.file property, outputting "number" as the key 
for

both files, but "address" for one and "name" for the other. By using a
commone key ("number"), you'll  ensure that the name and address make it
to
the same reducer after the shuffle. In the reducer, you'll then have the
relevant information (in the values) you need to create the name, 
address

pair.

On Fri, Feb 6, 2009 at 2:17 AM, Amandeep Khurana 


wrote:

> Thanks Jeff...
> I am not 100% clear about the first solution you have given. How do I
get
> the multiple files to be read and then feed into a single reducer? I
should
> have multiple mappers in the same class and have different job configs
for
> them, run two separate jobs with one outputing the key as 
> (name,number)

and
> the other outputing the value as (number, address) into the reducer?
> Not clear what I'll be doing with the map.intput.file here...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Fri, Feb 6, 2009 at 1:55 AM, Jeff Hammerbacher 
> 
> >wrote:
>
> > Hey Amandeep,
> >
> > You can get the file name for a task via the "map.input.file"
property.
> For
> > the join you're doing, you could inspect this property and ouput
(number,
> > name) and (number, address) as your (key, value) pairs, depending on
the
> > file you're working with. Then you can do the combination in your
> reducer.
> >
> > You could also check out the join package in contrib/utils (
> >
> >
>
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/contrib/utils/join/package-summary.html
> > ),
> > but I'd say your job is simple enough that you'll get it done faster
with
> > the above method.
> >
> > This task would be a simple join in Hive, so you could consider 
> > using

> Hive
> > to manage the data and perform the join.
> >
> > Later,
> > Jeff
> >
> > On Fri, Feb 6, 2009 at 1:34 AM, Amandeep Khurana 
> > 

> wrote:
> >
> > > Is it possible to write a map reduce job using multiple input 
> > > files?

> > >
> > > For example:
> > > File 1 has data like - Name, Number
> > > File 2 has data like - Number, Address
> > >
> > > Using these, I want to create a third file which has something 
> > > like

-
> > Name,
> > > Address
> > >
> > > How can a map reduce job be written to do this?
> > >
> > > Amandeep
> > >
> > >
> > >
> > > Amandeep Khurana
> > > Computer Science Graduate Student
> > > University of California, Santa Cruz
> > >
> >
>











Re: hadoop balanceing data

2009-01-24 Thread Billy Pearson

did not thank about that good points
I found a way to keep it from happening
I set dfs.datanode.du.reserved in the config file


"Hairong Kuang"  wrote in 
message news:c59f9164.ed09%hair...@yahoo-inc.com...
%Remaining is much more fluctuate than %dfs used. This is because dfs 
shares
the disks with mapred and mapred tasks may use a lot of disks temporally. 
So

trying to keep the same %free is impossible most of the time.

Hairong


On 1/19/09 10:28 PM, "Billy Pearson" 
 wrote:



Why do we not use the Remaining % in place of use Used % when we are
selecting datanode for new data and when running the balancer.
form what I can tell we are using the use % used and we do not factor in 
non

DFS Used at all.
I see a datanode with only a 60GB hard drive fill up completely 100% 
before

the other servers that have 130+GB hard drives get half full.
Seams like Trying to keep the same % free on the drives in the cluster 
would

be more optimal in production.
I know this still may not be perfect but would be nice if we tried.

Billy










hadoop balanceing data

2009-01-19 Thread Billy Pearson
Why do we not use the Remaining % in place of use Used % when we are 
selecting datanode for new data and when running the balancer.
form what I can tell we are using the use % used and we do not factor in non 
DFS Used at all.
I see a datanode with only a 60GB hard drive fill up completely 100% before 
the other servers that have 130+GB hard drives get half full.
Seams like Trying to keep the same % free on the drives in the cluster would 
be more optimal in production.

I know this still may not be perfect but would be nice if we tried.

Billy




Re: Namenode BlocksMap on Disk

2008-11-26 Thread Billy Pearson

Doug:
If we use the heap as a cache and you have a large cluster then you will 
have the memory on the NN to handle keeping all the namespace in memory.
We are looking for a way to support smaller clusters also that might over 
run there heap size causing the cluster to crash.
So if the NN has the room to cache all the namespace then the larger 
clusters will not see any disk hits once the namespace is fully loaded in to 
memory.


Billy



"Doug Cutting" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]

Dennis Kubes wrote:
2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I think the assumption is that it would be considerably more than slight 
degradation.  I've seen the namenode benchmarked at over 50,000 opens per 
second.  If file data is on disk, and the namespace is considerably bigger 
than RAM, then a seek would be required per access.  At 10MS/seek, that 
would give only 100 opens per second, or 500x slower. Flash storage today 
peaks at around 5k seeks/second.


For smaller clusters the namenode might not need to be able to perform 50k 
opens/second, but for larger clusters we do not want the namenode to 
become a bottleneck.


Doug






Re: Namenode BlocksMap on Disk

2008-11-26 Thread Billy Pearson
I would like to see something like this also I run 32bit servers so I am 
limited on how much memory I can use for heap. Besides just storing to disk 
I would like to see some sort of cache like a block cache that will cache 
parts the BlocksMap this would help reduce the hits to disk for lookups and 
still give us the ability to lower the memory requirement for the namenode.


Billy


"Dennis Kubes" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]
From time to time a message pops up on the mailing list about OOM errors 
for the namenode because of too many files.  Most recently there was a 1.7 
million file installation that was failing.  I know the simple solution to 
this is to have a larger java heap for the namenode.  But the non-simple 
way would be to convert the BlocksMap for the NameNode to be stored on 
disk and then queried and updated for operations.  This would eliminate 
memory problems for large file installations but also might degrade 
performance slightly.  Questions:


1) Is there any current work to allow the namenode to store on disk versus 
is memory?  This could be a configurable option.


2) Besides possible slight degradation in performance, is there a reason 
why the BlocksMap shouldn't or couldn't be stored on disk?


I am willing to put forth the work to make this happen.  Just want to make 
sure I am not going down the wrong path to begin with.


Dennis






Re: The Case of a Long Running Hadoop System

2008-11-15 Thread Billy Pearson
If I understand the secondary namenode merges the edits log in to the 
fsimage and reduces the edit log size.
Which is likely the root of your problems 8.5G seams large and likely 
putting a strain on your master servers memory and io bandwidth

Why do you not have a secondary namenode?

If you do not have the memory on the master I would look in to stopping a 
datanode/tasktracker on a server and loading the secondary namenode on it


Let it run for a while and watch your log for the secondary namenode you 
should see your edit log get smaller


I am not an expert but that would be my first action.

Billy



"Abhijit Bagri" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]

Hi,

This is a long mail as I have tried to put in as much details as might
help any of the Hadoop dev/users to help us out. The gist is this:

We have a long running Hadoop system (masters not restarted for about
3 months). We have recently started seeing the DFS responding very
slowly which has resulted in failures on a system which depends on
Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is
a good representation which I believe it is). The edits file

These are the details (skip/return here later and jump to the
questions at the end of the mail for a quicker read) :

Hadoop Version: 0.15.3 on 32 bit systems.

Number of slaves: 12
Slaves heap size: 1G
Namenode heap: 2G
Jobtracker heap: 2G

The namenode and jobtrackers have not been restarted for about 3
months. We did restart slaves(all of them within a few hours) a few
times for some maintaineance in between though. We do not have a
secondary namenode in place.

There is another system X which talks to this hadoop cluster. X writes
to the Hadoop DFS and submits jobs to the Jobtracker. The number of
jobs submitted to Hadoop so far is over 650,000 ( I am using the job
id for jobs for this), each job may rad/write to multiple files and
has several dependent libraries which it loads from Distributed Cache.

Recently, we started seeing that there were several timeouts happening
while X tries to read/write to the DFS. This in turn results in DFS
becoming very slow in response. The writes are especially slow. The
trace we get in the logs are:

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.net.SocketInputStream.read(SocketInputStream.java:182)
at java.io.DataInputStream.readShort(DataInputStream.java:284)
at org.apache.hadoop.dfs.DFSClient
$DFSOutputStream.endBlock(DFSClient.java:1660)
at org.apache.hadoop.dfs.DFSClient
$DFSOutputStream.close(DFSClient.java:1733)
at org.apache.hadoop.fs.FSDataOutputStream
$PositionCache.close(FSDataOutputStream.java:49)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
64)
...

Also, datanode logs show a lot of traces like these:

2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
valid, and cannot be written to.
at
org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
at org.apache.hadoop.dfs.DataNode
$BlockReceiver.(DataNode.java:1257)
at org.apache.hadoop.dfs.DataNode
$DataXceiver.writeBlock(DataNode.java:901)
at org.apache.hadoop.dfs.DataNode
$DataXceiver.run(DataNode.java:804)
at java.lang.Thread.run(Thread.java:595)

and these

2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Error in deleting blocks.
at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
719)
at org.apache.hadoop.

The first one seems to be same as HADOOP-1845. We have been seeing
this for a long time, but as HADOOP-1845 says, it hasn't been creating
any problems. We have thus ignored this for a while, but this recent
problems have raised eyebrows again if this really is a non-issue.

We also saw a spike in number of TCP connections on the system X.
Also, a lsof on the system X shows a lot of connections to datanodes
in the CLOSE_WAIT state.

While investigating the issue, the fsck output started saying that DFS
is corrupt. The output is:

Total size:152873519485 B
Total blocks:  1963655 (avg. block size 77851 B)
Total dirs:1390820
Total files:   1962680
Over-replicated blocks:111 (0.0061619785 %)
Under-replicated blocks:   13 (6.6203077E-4 %)
Target replication factor: 3
Real replication factor:   3.503


The filesystem under path '/' is CORRUPT

After running a fsck -move, the system went back to healthy state.
However, after some time it again started showing corrupt and fsk -
move restored it to a healthy state.

We are considering restarting the masters as a possible solution. I
have earlier had issue with restarting Master with 2G heap and edits
file size of about 2.5G. So, we decided to lookup the size of the

Re: Any Way to Skip Mapping?

2008-11-03 Thread Billy Pearson
I need the Reduce to Sort so I can merge the records and output in a sorted 
order.
I do not need to join any data just merge rows together so I do not thank 
the join will be any help.


I am storing the data like >> with a 
sorted map as the value.
and on the merge I need to take all the rows that have the same key and 
merge all the sorted maps together and output one row that has all the data 
for that key

something like what hbase is doing but without the in memory index's

Maybe it will be come an option later down the row to skip the maps and let 
the reduce Shuffle directly from the inputSplits.


Billy




"Owen O'Malley" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]
If you don't need a sort, which is what it sounds like, Hadoop  supports 
that by turning off the reduce. That is done by setting the  number of 
reduces to 0. This typically is much faster than if you need  the sort. It 
also sounds like you may need/want the library that does  map-side joins.

http://tinyurl.com/43j5pp

-- Owen






Any Way to Skip Mapping?

2008-11-01 Thread Billy Pearson
I have a job that merges multi output directories of MR jobs that run over 
time.


The output of them are all the same and the MR that merges them uses a 
mapper that just outputs the same key,value as its is given so basically the 
same as the IdentityMapper


The Problem I am seeing is as I add more and more data in to the records I 
see longer and longer map times


Is there a way to skip the mapping of the inputsplits and just let the 
reduces copy from hdfs?


on my job here is an example

20 reduce task run at once

but my job reduce number is 60

First round of reduces to run finishes the shuffle in (2hrs, 29mins, 25sec) 
this includes the map times
second and third round of reduces have the map inputs ready for copy so no 
wait (13mins, 30secs)


So right now I am wasting 2hrs, 10mins for the IdentityMapper to run on all 
the inputs and this will get longger and longer as the amount of data 
incresses.
I need to reduce to merge and sort the inputs does anyone know of a better 
what to do a merge like this in a MR job?





Re: Improving locality of table access...

2008-10-22 Thread Billy Pearson

generate a patch and post it here
https://issues.apache.org/jira/browse/HBASE-675

Billy

"Arthur van Hoff" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]

Hi,

Below is some code for improving the read performance of large tables by
processing each region on the host holding that region. We measured 50-60%
lower network bandwidth.

To use this class instead of 
org.apache.hadoop.hbase.mapred.TableInputFormat

class use:

   jobconf.setInputFormat(ellerdale.mapreduce.TableInputFormatFix);

Please send me feedback, if you can think off better ways to do this.

--
Arthur van Hoff - Grand Master of Alphabetical Order
The Ellerdale Project, Menlo Park, CA
[EMAIL PROTECTED], 650-283-0842


-- TableInputFormatFix.java --

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements.  See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership.  The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License.  You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Author: Arthur van Hoff, [EMAIL PROTECTED]

package ellerdale.mapreduce;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;

import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.mapred.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.io.*;
import org.apache.hadoop.hbase.util.*;

//
// Attempt to fix the localized nature of table segments.
// Compute table splits so that they are processed locally.
// Combine multiple splits to avoid the number of splits exceeding
numSplits.
// Sort the resulting splits so that the shortest ones are processed last.
// The resulting savings in network bandwidth are significant (we measured
60%).
//
public class TableInputFormatFix extends TableInputFormat
{
   public static final int ORIGINAL= 0;
   public static final int LOCALIZED= 1;
   public static final int OPTIMIZED= 2;// not yet functional

   //
   // A table split with a location.
   //
   static class LocationTableSplit extends TableSplit implements 
Comparable

   {
   String location;

   public LocationTableSplit()
   {
   }
   public LocationTableSplit(byte [] tableName, byte [] startRow, byte []
endRow, String location)
   {
   super(tableName, startRow, endRow);
   this.location = location;
   }
   public String[] getLocations()
   {
   return new String[] {location};
   }
   public void readFields(DataInput in) throws IOException
   {
   super.readFields(in);
   this.location = Bytes.toString(Bytes.readByteArray(in));
   }
   public void write(DataOutput out) throws IOException
   {
   super.write(out);
   Bytes.writeByteArray(out, Bytes.toBytes(location));
   }
   public int compareTo(Object other)
   {
   LocationTableSplit otherSplit = (LocationTableSplit)other;
   int result = Bytes.compareTo(getStartRow(),
otherSplit.getStartRow());
   return result;
   }
   public String toString()
   {
   return location.substring(0, location.indexOf('.')) + ": " +
Bytes.toString(getStartRow()) + "-" + Bytes.toString(getEndRow());
   }
   }

   //
   // A table split with a location that covers multiple regions.
   //
   static class MultiRegionTableSplit extends LocationTableSplit
   {
   byte[][] regions;

   public MultiRegionTableSplit()
   {
   }
   public MultiRegionTableSplit(byte[] tableName, String location, 
byte[][]

regions) throws IOException
   {
   super(tableName, regions[0], regions[regions.length-1], location);
   this.location = location;
   this.regions = regions;
   }
   public void readFields(DataInput in) throws IOException
   {
   super.readFields(in);
   int n = in.readInt();
   regions = new byte[n][];
   for (int i = 0 ; i < n ; i++) {
   regions[i] = Bytes.readByteArray(in);
   }
   }
   public void write(DataOutput out) throws IOException
   {
   super.write(out);
   out.writeInt(regions.length);
   for (int i = 0 ; i < regions.length ; i++) {
   Bytes.writeByteArray(out, regions[i]);
   }
   }
   public String toString()
   {
   String str = location.substring(0, location.indexOf('.')) + ": ";
   for (int i = 0 ; i < regions.length ; i += 2) {
   if (i > 0) {
   str += ", ";
   }
   str += Bytes.toString(regi

Re: Maps running after reducers complete successfully?

2008-10-03 Thread Billy Pearson

Do we not have an option to store the map results in hdfs?

Billy

"Owen O'Malley" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]
It isn't optimal, but it is the expected behavior. In general when we 
lose a TaskTracker, we want the map outputs regenerated so that any 
reduces that need to re-run (including speculative execution). We  could 
handle it as a special case if:

  1. We didn't lose any running reduces.
  2. All of the reduces (including speculative tasks) are done with 
shuffling.

  3. We don't plan on launching any more speculative reduces.
If all 3 hold, we don't need to re-run the map tasks. Actually doing  so, 
would be a pretty involved patch to the JobTracker/Schedulers.


-- Owen






Re: Can hadoop sort by values rather than keys?

2008-09-28 Thread Billy Pearson

Might be able to use InverseMapper.class

To help flip the key/value to value/key

Billy


"Jeremy Chow" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]

Hi list,
 The default way hadoop doing its sorting is by keys , can it sort by
values rather than keys?

Regards,
Jeremy
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com






Re: adding nodes while computing

2008-09-28 Thread Billy Pearson
You should be able to add nodes to the cluster while jobs are running the 
jobtracker should start assigning task to the tasktrackers and dfs should 
start using the nodes for storage


But map data files are stored on the slaves and copied to the reduce task so 
if a node goes down during a MR job then the maps will have to be ran again.


Billy

"Francois Berenger" <[EMAIL PROTECTED]> wrote 
in message news:[EMAIL PROTECTED]

Hello,

Is this possible to add slaves with IP address not known in advance
to an Hadoop cluster while a computation is going on?

And the reverse capability: is it possible to cleanly permanently
remove a slave node from the hadoop cluster?

Thank you,
François.






Re: Is Hadoop the thing for us ?

2008-06-27 Thread Billy Pearson
I do not totally understand you job you are running but if each simulation 
can run independent of each other then you could run a map reduce job that 
will spread the simulation's over many servers so each one can run one or 
more at the same time this will give you a level of protection on servers 
going down and take care of the work on spreading out the work to server 
also this should be able to handle more then the 100K simulation mark you 
stated you would like to run. You would just need to write a the input code 
to handle splitting the simulations into splits that the MR framework could 
work with.


Billy

"Igor Nikolic" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]

Thank you for your comment, it did confirm my suspicions.

You framed the problem correctly. I will probably invest a bit of time 
studying the framework anyway, to see if a rewrite is interesting, since 
we hit scaling limitations on our Agent scheduler framework. Our main 
computational load is the massive amount of agent reasoning ( think 
JbossRules) and  inter-agent communication ( they need to sell and buy 
stuff to each other)  so I am not sure if it is at all possible to break 
it down to small tasks, specially if this needs to happen across CPU's, 
the latency is going to kill us.


Thanks
igor

John Martyniak wrote:

I am new to Hadoop.  So take this information with a grain of salt.
But the power of Hadoop is breaking down big problems into small pieces 
and

spreading it across many (thousands) of machines, in effect creating a
massively parallel processing engine.

But in order to take advantage of that functionality you must write your
application to take advantage of it, using the Hadoop frameworks.

So if I understand  your dilemma correctly.  I do not think that Hadoop 
is
for you, unless you want to re-write your app to take advantage of it. 
And
I suspect that if you have access to a traditional cluster, that will be 
a

better alternative for you.

Hope that this helps some.

-John


On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic 
<[EMAIL PROTECTED]> wrote:




Hello list

We will be getting access to a cluster soon, and I was wondering whether
this I should use Hadoop ?  Or am I better of with the usual batch
schedulers such as ProActive etc ? I am not a CS/CE person, and from 
reading

the website I can not get a sense of whether hadoop is for me.

A little background:
We have a  relatively large agent based simulation ( 20+ MB jar) that 
needs

to be swept across very large parameter spaces. Agents communicate only
within the simulation, so there is no interprocess communication. The
parameter vector is max 20 long , the simulation may take 5-10 minutes 
on a
normal desktop and it might return a few mb of raw data. We need 
10k-100K

runs, more if possible.



Thanks for advice, even a short yes/no is welcome

Greetings
Igor

--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl










--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl







Re: how to write data to one file on HDF S by some clients Synchronized?

2008-06-24 Thread Billy Pearson

https://issues.apache.org/jira/browse/HADOOP-1700


"过佳" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
Does HDFS support it?I need it to be synchronized , e.g. I call many 
clients

to write a lots of IntWritable to one file.


Best.
Jarvis.






MR input Format Type mismatch

2008-06-21 Thread Billy Pearson
2008-06-21 20:30:18,928 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Type mismatch in key from map: expected 
org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:419)

at com.compspy.mapred.RecordImport$MapClass.map(RecordImport.java:96)
at com.compspy.mapred.RecordImport$MapClass.map(RecordImport.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

I am trying to run a MR job to take a file made with php as an import I have 
the lines formated like this


Number\tText\r\n

So each row has a line number from 0 - X a tab then the Text then a \r\n for 
a newline for each record
I do not need the line number but been getting errors like above so I 
reformated my file so it would have one trying to get this to work


Is there somethign I am missing should there be "" around the text or <> 
around the numbers etc..?


The Text is going to be used to make Hbase updates/inserts in the reduce so 
I want to output Text,RowResults
If I do not get the type mismatch I get a casting error when I have tryed 
different things

I can reformat my input file if need to just need to get my mapper to work.
How can I input my file in to a MR job?
my mapper below I am trying to use

 public static class MapClass extends MapReduceBase
  implements Mapper {

   public void map(LongWritable key, Text value,
   OutputCollector output,
   Reporter reporter) throws IOException {

   // do some work here to get the below items from the input

 output.collect(rowkey, newvals);
   }
 }




Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
My second question is about the ec2 machines has anyone solved the hostname 
problem in a automated way?


Example if I launch a ec2 server to run a task tracker the hostname reported 
back to my local cluster with its internal address
the local reduce task can not access the map files on the ec2 machine 
because with the default hostname.

I get a error:
WARN org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException: 
domU-12-31-39-00-A4-05.compute-1.internal



Is there a automated way to start a tasktracker on a ec2 machine with it 
useing the public hostname so the local task can get the maps from the ec2 
machines?

example something like
bin/hadoop-daemon.sh start tasktracker 
host=ec2-xx-xx-xx-xx.z-2.compute-1.amazonaws.com


That I can run to start just the tasktracker with the correct hostname


What I am trying to do is build a custom ami image that I can just launch 
when need to add extra cpu power to my cluster and to automatically start

the tasktracker vi a shell script that can be ran at startup.

Billy


"Billy Pearson" <[EMAIL PROTECTED]> 
wrote in message news:[EMAIL PROTECTED]
I have a question someone may have answered here before but I can not find 
the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and 
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on 
the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 they 
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I 
can kill the EC2 instances sense I do not need them for the reduces.


The only way I can thank to do this is run two jobs one maper and store 
the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?







Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson

I understand how to run it as two jobs my only question is
Is there away to make the mappers store the final output in hdfs?

so I can kill the ec2 machines without waiting to the reduce stage ends!

Billy



"Chris K Wensel" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]
well, to answer your last question first, just set the # reducers to 
zero.


but you can't just run reducers without mappers (as far as I know,  having 
never tried). so your local job will need to run identity  mappers in 
order to feed your reducers.

http://hadoop.apache.org/core/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

ckw

On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote:

I have a question someone may have answered here before but I can  not 
find the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to  run 
and the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces  on 
the local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 
they are stored local on the instance
What I am looking to do is have the map output files stored in hdfs  so I 
can kill the EC2 instances sense I do not need them for the  reduces.


The only way I can thank to do this is run two jobs one maper and  store 
the output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs?



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/











Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
I have a question someone may have answered here before but I can not find 
the answer.


Assuming I have a cluster of servers hosting a large amount of data
I want to run a large job that the maps take a lot of cpu power to run and 
the reduces only take a small amount cpu to run.
I want to run the maps on a group of EC2 servers and run the reduces on the 
local cluster of 10 machines.


The problem I am seeing is the map outputs, if I run the maps on EC2 they 
are stored local on the instance
What I am looking to do is have the map output files stored in hdfs so I can 
kill the EC2 instances sense I do not need them for the reduces.


The only way I can thank to do this is run two jobs one maper and store the 
output on hdfs and then run a second job to run the reduces

from the map outputs store on the hfds.

Is there away to make the mappers store the final output in hdfs? 





Re: Streaming --counters question

2008-06-10 Thread Billy Pearson
Streaming works on stdin and stdout so unless there was a way to capture the 
stdout as a counter I do not see any other way to report the to the 
jobtracker. Unless there was a url the task could call on the jobtracker to 
update counters.


Billy



"Miles Osborne" <[EMAIL PROTECTED]> wrote in 
message news:[EMAIL PROTECTED]
Is there support for counters in streaming?  In particular, it would be 
nice

to be able to access these after a job has run.

Thanks!

Miles


--
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.