Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread Scott Carey
Furthermore, if for some reason it is required to dispose of any objects after 
others are GC'd, weak references and a weak reference queue will perform 
significantly better in throughput and latency - orders of magnitude better - 
than finalizers.

On 6/21/09 9:32 AM, brian.lev...@nokia.com brian.lev...@nokia.com wrote:

IMHO, you should never rely on finalizers to release scarce resources since you 
don't know when the finalizer will get called, if ever.

-brian



-Original Message-
From: ext jason hadoop [mailto:jason.had...@gmail.com]
Sent: Sunday, June 21, 2009 11:19 AM
To: core-user@hadoop.apache.org
Subject: Re: Too many open files error, which gets resolved after some time

HDFS/DFS client uses quite a few file descriptors for each open file.

Many application developers (but not the hadoop core) rely on the JVM
finalizer methods to close open files.

This combination, expecially when many HDFS files are open can result in
very large demands for file descriptors for Hadoop clients.
We as a general rule never run a cluster with nofile less that 64k, and for
larger clusters with demanding applications have had it set 10x higher. I
also believe there was a set of JVM versions that leaked file descriptors
used for NIO in the HDFS core. I do not recall the exact details.

On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 After tracing some more with the lsof utility, and I managed to stop the
 growth on the DataNode process, but still have issues with my DFS client.

 It seems that my DFS client opens hundreds of pipes and eventpolls. Here is
 a small part of the lsof output:

 java10508 root  387w  FIFO0,6   6142565 pipe
 java10508 root  388r  FIFO0,6   6142565 pipe
 java10508 root  389u     0,100  6142566
 eventpoll
 java10508 root  390u  FIFO0,6   6135311 pipe
 java10508 root  391r  FIFO0,6   6135311 pipe
 java10508 root  392u     0,100  6135312
 eventpoll
 java10508 root  393r  FIFO0,6   6148234 pipe
 java10508 root  394w  FIFO0,6   6142570 pipe
 java10508 root  395r  FIFO0,6   6135857 pipe
 java10508 root  396r  FIFO0,6   6142570 pipe
 java10508 root  397r     0,100  6142571
 eventpoll
 java10508 root  398u  FIFO0,6   6135319 pipe
 java10508 root  399w  FIFO0,6   6135319 pipe

 I'm using FSDataInputStream and FSDataOutputStream, so this might be
 related
 to pipes?

 So, my questions are:

 1) What happens these pipes/epolls to appear?

 2) More important, how I can prevent their accumation and growth?

 Thanks in advance!

 2009/6/21 Stas Oskin stas.os...@gmail.com

  Hi.
 
  I have HDFS client and HDFS datanode running on same machine.
 
  When I'm trying to access a dozen of files at once from the client,
 several
  times in a row, I'm starting to receive the following errors on client,
 and
  HDFS browse function.
 
  HDFS Client: Could not get block locations. Aborting...
  HDFS browse: Too many open files
 
  I can increase the maximum number of files that can opened, as I have it
  set to the default 1024, but would like to first solve the problem, as
  larger value just means it would run out of files again later on.
 
  So my questions are:
 
  1) Does the HDFS datanode keeps any files opened, even after the HDFS
  client have already closed them?
 
  2) Is it possible to find out, who keeps the opened files - datanode or
  client (so I could pin-point the source of the problem).
 
  Thanks in advance!
 




--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals



Re: :!!

2009-06-19 Thread Scott Carey
Yes, any machine that has network access to the cluster can read/write to hdfs. 
 It does not need to be part of the cluster or running any hadoop daemons.

Such a client just needs to have hadoop set up on it and the configuration 
details for contacting the namenode.
If using the hadoop command line, this means that the hadoop xml config files 
have to be set up.  If you embed the hadoop jars in your own app, you have to 
provide the config information via files or programatically.

Essentially, the client only needs to know how to contact the namenode.   The 
namenode will automatically tell the hdfs client how to communicate to each 
datanode for storing or getting data.


On 6/14/09 8:54 PM, Sugandha Naolekar sugandha@gmail.com wrote:

Hello!

I want to execute all my code on a machine that's remote(not a part of
hadoop cluster).
This code includes ::file transfers between any nodes (remote or within
hadoop cluster or within same LAN)-irrespective.; and HDFS. I will have to
simply write a code for this.

Is it possible?

Thanks,
Regards-

--
Regards!
Sugandha



Re: Hadoop scheduling question

2009-06-05 Thread Scott Carey
Even more general context:
Cascading does something similar, but I am not sure if it uses Hadoop's
JobControl or manages dependencies itself.  It definitely runs multiple jobs
in parallel when the dependencies allow it.



On 6/5/09 11:44 AM, Alan Gates ga...@yahoo-inc.com wrote:

 To add a little context, Pig uses Hadoop's JobControl to schedule it's
 jobs.  Pig defines the dependencies between jobs in JobControl, and
 then submits the entire graph of jobs.  So, using JobControl, does
 Hadoop schedule jobs serially or in parallel (assuming no dependencies)?
 
 Alan.
 
 On Jun 5, 2009, at 10:50 AM, Kristi Morton wrote:
 
 Hi Pankil,
 
 Sorry about having to send my question email twice to the list...
 the first time I sent it I had forgotten to subscribe to the list.
 I resent it after subscribing, and your response to the first email
 I sent did not make it into my inbox.  I saw your response on the
 archives list.
 
 So, to recap, you said:
 
 We are not able to carry out all joins in a single job..we also
 tried our hadoop code using
 Pig scripts and found that for each join in PIG script new job is
 used.So
 basically what i think its a sequential process to handle typesof
 join where
 output of one job is required s an input to other one.
 
 
 I, too, have seen this sequential behavior with joins.  However, it
 seems like it could be possible for there to be two jobs executing
 in parallel whose output is the input to the subsequent job.  Is
 this possible or are all jobs scheduled sequentially?
 
 Thanks,
 Kristi
 
 
 



Re: question about when shuffle/sort start working

2009-06-02 Thread Scott Carey

On 6/2/09 1:22 AM, Chuck Lam chuck@gmail.com wrote:

 
 Counters are integers and most convergence criteria are floating point, so
 you'll have to scale your numbers and round them to integers to approximate
 things. (Like I said, it's a bit of a hack.)
 

No rounding or loss of data is necessary for this sort of operation.
For packing floats into ints, or doubles into longs in Java, there are some
convenient (but not well known) methods to just get the raw bits and assign
as the other type, without a cast conversion.
One can use a Float, and then floatToIntBits, and similar methods, to
extract the raw bits of a float or double and write as an int or long. The
raw bits of an IEEE float lexicographically sort in the same order as the
raw bits of the same sized int too, which is useful.

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Float.html#floatToIntBits(
float)



Re: SequenceFile and streaming

2009-05-29 Thread Scott Carey
Well, I don't know much about the tar tool at all.  But bz2 is a VERY slow
compression scheme (though quite fascinating to read about how it works).  A
plain tar, or tar.gz will be faster if it is supported.


On 5/28/09 10:10 PM, walter steffe ste...@tiscali.it wrote:

 Hi Tom,
 
   i have seen the tar-to-seq tool but the person who made it says it is
 very slow:
 It took about an hour and a half to convert a 615MB tar.bz2 file to an
 868MB sequence file. To me it is not acceptable.
 Normally to generate a tar file from 615MB od data it take s less then
 one minute. And, in my view the generatin of a sequence file should be
 even simper. You have just to append files and headers without worring
 about hierarchy.
 
 Regarding the SequenceFileAsTextInputFormat I am not sure it will do the
 job I am looking for.
 The hadoop documentation says: SequenceFileAsTextInputFormat generates
 SequenceFileAsTextRecordReader which converts the input keys and values
 to their String forms by calling toString() method.
 Let we suppose that the keys and values were generated using tar-to-seq
 on a tar archive. Each value is a bytearray that stores the content of a
 file which can be any kind of data (in example a jpeg picture). It
 doesn't make sense to convert this data into a string.
 
 What is needed is a tool to simply extract the file as with
 tar -xf archive.tar filename. The hadoop framework can be used to
 extract a Java class and you have to do that within a java program. The
 streaming package is meant to be used in a unix shell without the need
 of java programming. But I think it is not very usefull if the
 sequencefile (which is the principal data structure of hadoop) is not
 accessible from a shell command.
 
 
 Walter
 
 
 



Re: Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread Scott Carey
Yes and no.  Most OSs/filesystems will get file data to disk within 5
seconds if the files are small.  But if it is written, read, and deleted
quickly it may not ever hit disk.  Applications may request that data is
flushed to disk earlier.

In a Hadoop environment, smaller or medium sized files most likely will get
to disk, but read from page cache in RAM rather than disk.

You can tune the OS to cache more in RAM, for longer, before flushing to
disk if you wish.  For linux look up /proc/sys/vm  (dirty_ratio,
dirty_backround_ratio, and related).

On 5/19/09 7:19 AM, paula_ta paula...@yahoo.com wrote:

 
 
 Is it possible that some intermediate data produced by mappers and written to
 the local file system resides in memory in the file system cache and is
 never flushed to disk ?  Eventually reducers will retrieve this data via
 HTTP - possibly without the data ever being written to disk ?
 
 thanks
 Paula
 
 --
 View this message in context:
 http://www.nabble.com/Is-intermediate-data-produced-by-mappers-always-flushed-
 to-disk---tp23617347p23617347.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-11 Thread Scott Carey
Before Java 1.6, Jrockit was almost always faster than Sun, and often by a
lot (especially during the 1.4 days).  Now, its much more use-case
dependant.  Some apps are faster on one than another, and vice-versa.

I have tested many other applications with both in the past (and IBM's VM on
AIX, and HP's VM on HP-UX), but not Hadoop.  I suppose it just may be a use
case that Sun has done a bit better.

The Jrockit settings that remain that may be of use are the TLA settings.
You can use Mission Control to do a memory profile and see if the average
object sizes are large enough to warrant increasing the thread local object
size thresholds.  That's the only major tuning knob I recall that I don't
see below.  If Hadoop is creating a lot of medium sized (~1000 byte to
32kbyte) objects Jrockit isn't so optimized by default for that.

You should consider sending the data to the Jrockit team.  They are
generally on the lookout for example use-cases where they do poorly relative
to Sun.  However, now that they are all under the Oracle-Larry-Umbrella it
wouldn't shock me if that changes.

On 5/7/09 6:34 PM, Grace syso...@gmail.com wrote:

 Thanks all for your replying.
 
 I have run several times with different Java options for Map/Reduce
 tasks. However there is no much difference.
 
 Following is the example of my test setting:
 Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages
 -XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch
 Test B: -Xmx1024m
 Test C: -Xmx1024m -XXaggressive
 
 Is there any tricky or special setting for Jrockit vm on Hadoop?
 
 In the Hadoop Quick Start guides, it says that JavaTM 1.6.x, preferably
 from Sun. Is there any concern about the Jrockit performance issue?
 
 I'd highly appreciate for your time and consideration.
 
 
 On Fri, May 8, 2009 at 7:36 AM, JQ Hadoop jq.had...@gmail.com wrote:
 
 There are a lot of tuning knobs for the JRockit JVM when it comes to
 performance; those tuning can make a huge difference. I'm very interested
 if
 there are some tuning tips for Hadoop.
 
 Grace, what are the parameters that you used in your testing?
 
 Thanks,
 JQ
 
 On Thu, May 7, 2009 at 11:35 PM, Steve Loughran ste...@apache.org wrote:
 
 Chris Collins wrote:
 
 a couple of years back we did a lot of experimentation between sun's vm
 and jrocket.  We had initially assumed that jrocket was going to scream
 since thats what the press were saying.  In short, what we discovered
 was
 that certain jdk library usage was a little bit faster with jrocket, but
 for
 core vm performance such as synchronization, primitive operations the
 sun vm
 out performed.  We were not taking account of startup time, just raw
 code
 execution.  As I said, this was a couple of years back so things may of
 changed.
 
 C
 
 
 I run JRockit as its what some of our key customers use, and we need to
 test things. One lovely feature is tests time out before the stack runs
 out
 on a recursive operation; clearly different stack management at work.
 Another: no PermGenHeapSpace to fiddle with.
 
 * I have to turn debug logging of in hadoop test runs, or there are
 problems.
 
 * It uses short pointers (32 bits long) for near memory on a 64 bit JVM.
 So
 your memory footprint on sub-4GB VM images is better. Java7 promises
 this,
 and with the merger, who knows what we will see. This is unimportant  on
 32-bit boxes
 
 * debug single stepping doesnt work. That's ok, I use functional tests
 instead :)
 
 I havent looked at outright performance.
 
 /
 
 
 



Re: PIG and Hive

2009-05-07 Thread Scott Carey
The work was done 3 months ago, and the exact query I used may not have been 
the below - it was functionally the same - two sources,  arithmetic aggregation 
on each inner-joined by a small set of values.  We wrote a hand-coded map 
reduce, a Pig script, and Hive against the same data and performance tested.

At that time, even SELECT count(a.z) FROM a group by a.z took 3 phases (not 
sure how many were fetch versus M/R).  Since then, we abandoned Hive for 
reassessment at a later date.  All releases of Hive since then 
http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything 
under optimizations and few of the enhancements listed suggest that there has 
been much change on the performance front (yet).

Can Hive not yet detect an implicit inner join in a WHERE clause?

Our use case would have less optimization-savvy people querying data ad-hoc, so 
being able to detect implicit joins and collapse subselects, etc is a 
requirement.  I'm not going to go sitting over the shoulder of everyone who 
wants to do some ad-hoc data analysis and tell them how to re-write their 
queries to perform better.
That is a big weakness of SQL that affects everything that uses it - there are 
so many equivalent or near-equivalent forms of expression that often lead to 
implementation specific performance preferences.

I'm sure Hive will get over that hump but it takes time.  I'm certainly 
interested in it and will have a deeper look again in the second half of this 
year.

On 5/7/09 10:12 AM, Namit Jain nj...@facebook.com wrote:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

If you do a explain on the above query, you will see that you are performing a 
Cartesian product followed by the filter.

It would be better to rewrite the query as:


SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = 
b.y)
group by a.x, a.y;

The explain should have 2 map-reduce jobs and a fetch task (which is not a 
map-reduce job).
Can you send me the exact Hive query that you are trying along with the schema 
of tables 'a' and 'b'.

In order to see the plan, you can do:

Explain
QUERY



Thanks,
-namit



-- Forwarded Message
From: Ricky Ho r...@adobe.com
Reply-To: core-user@hadoop.apache.org
Date: Wed, 6 May 2009 21:11:43 -0700
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a algorithm 
design / prototyping tool) to express my parallel algorithms in a concise way.  
The deeper I look into these, I have a stronger feeling that PIG and HIVE are 
competitors rather than complementing each other.  I think a large set of 
problems can be done in either way, without much difference in terms of 
skillset requirements.

At this moment, I am focus in the richness of the language model rather than 
the implementation optimization.  Supporting collection as well as the 
flatten operation in the language model seems to make PIG more powerful.  Yes, 
you can achieve the same thing in Hive but then it starts to look odd.  Am I 
missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many
fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and
in the future they won't differ much.  But for now, Pig optimizes much
better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5
map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to
make that one pass, but those sort of performance optimizations aren't there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?)
could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, Olga Natkovich ol...@yahoo-inc.com wrote:

 Hi Ricky,

 This is how the code will look in Pig.

 A = load 'textdoc' using TextLoader() as (sentence: chararray);
 B = foreach A generate flatten(TOKENIZE(sentence)) as word;
 C = group B by word;
 D = foreach C generate group, COUNT(B);
 store D into 'wordcount';

 Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
 explains how the example above works.

 Let me know if you have further questions.

 Olga


 -Original Message-
 From: Ricky Ho [mailto:r...@adobe.com]
 Sent: Wednesday, May 06, 2009 3:56 PM
 To: core-user@hadoop.apache.org
 Subject

Re: Master crashed

2009-04-30 Thread Scott Carey

On 4/30/09 10:18 AM, Mayuran Yogarajah mayuran.yogara...@casalemedia.com
wrote:

 Alex Loddengaard wrote:
 I'm confused.  Why are you trying to stop things when you're bringing the
 name node back up?  Try running start-all.sh instead.
 
 Alex
 
  
 Won't that try to start the daemons on the slave nodes again? They're
 already running.
 

That doesn't matter, start-all.sh detects already running processes and does
not bring up duplicates. You can run it 100x in a row without a stop if you
wanted:

namenode running as process 12621. Stop it first.
datanode running as process 28540. Stop it first.
jobtracker running as process 12814. Stop it first.
tasktracker running as process 28763. Stop it first.



 M
 On Tue, Apr 28, 2009 at 4:00 PM, Mayuran Yogarajah 
 mayuran.yogara...@casalemedia.com wrote:
 
  
 The master in my cluster crashed, the dfs/mapred java processes are
 still running on the slaves.  What should I do next? I brought the master
 back up and ran stop-mapred.sh and stop-dfs.sh and it said this:
 
 slave1.test.com: no tasktracker to stop
 slave1.test.com: no datanode to stop
 
 Not sure what happened here, please advise.
 
 thanks,
 M
 

 
 



Re: Typical hardware configurations

2009-03-30 Thread Scott Carey
On 3/30/09 4:41 AM, Steve Loughran ste...@apache.org wrote:

 Ryan Rawson wrote:
 
 You should also be getting 64-bit systems and running a 64 bit distro on it
 and a jvm that has -d64 available.
 
 For the namenode yes. For the others, you will take a fairly big memory
 hit (1.5X object size) due to the longer pointers. JRockit has special
 compressed pointers, so will JDK 7, apparently.
 

Sun Java 6 update 14 has ³Ordinary Object Pointer² compression as well.
-XX:+UseCompressedOops.  I¹ve been testing out the pre-release of that with
great success.

Jrockit has virtually no 64 bit overhead up to 4GB, Sun Java 6u14 has small
overhead up to 32GB with the new compression scheme.  IBM¹s VM also has some
sort of pointer compression but I don¹t have experience with it myself.

http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/tag/compressed-oops/
 
With pointer compression, there may be gains to be had with running 64 bit
JVMs smaller than 4GB on x86 since then the runtime has access to native 64
bit integer operations and registers (as well as 2x the register count).  It
will be highly use-case dependent.



Re: Problem : data distribution is non uniform between two different disks on datanode.

2009-03-17 Thread Scott Carey

Are you stopping and starting data nodes often? Are your files small on
average?  What Hadoop version?

It looks like on startup The datanode chooses the first volume to use for
the first block it writes and is round-robin from there.

Are you simply adding the extra disk and changing the config?  Or were both
mounts there from the start?  It should not fail until both are full either
way.


The only improvements I see in the trunk (inner class FSVolumeSet in
FSDataset.java) are:
* Initialize the current volume to a random index in the constructor rather
than the first one.
* Rather than choose by round-robin, weight the choice by free space
available.  This does not have to check all disks' free space each time, it
can remember the values of all volumes and only update the free space of the
current one under consideration during the check it currently does.




On 3/16/09 5:19 AM, Vaibhav J vaibh...@rediff.co.in wrote:

 
 
 
 
   _ 
 
 From: Vaibhav J [mailto:vaibh...@rediff.co.in]
 Sent: Monday, March 16, 2009 5:46 PM
 To: 'nutch-...@lucene.apache.org'; 'nutch-u...@lucene.apache.org'
 Subject: Problem : data distribution is non uniform between two different
 disks on datanode.
 
 
 
 
 
 
 
 
 
 We have 27 datanode and replication factor is 1. (data size is ~6.75 TB)
 
 We have specified two different disks for dfs data directory on each
 datanode by using
 
 property dfs.data.dir in hadoop-site.xml file of conf directory.
 
 (value of property dfs.data.dir : /mnt/hadoop-dfs/data,
 /mnt2/hadoop-dfs/data)
 
 
 
 when we are setting replication factor 2 then data distribution is biased to
 first disk,
 
 more data is coping on /mnt/hadoop-dfs/data and after copying some
 data...first disk becomes full
 
 and showing no available space on disk while we have enough space on second
 disk (/mnt2/hadoop-dfs/data ).
 
 so, it is difficult to achieve replication factor 2.
 
 
 
 Data traffic is coming on second disk also (/mnt2/hadoop-dfs/data) but it
 looks that
 
 more data is copied on fisrt disk (/mnt/hadoop-dfs/data).
 
 
 
 
 
 What should we do to get uniform data distribution between two different
 disks on
 
 each datanode to achieve replication factor 2?
 
 
 
 
 
 Regards
 
 Vaibhav J.
 
 



Re: tuning performance

2009-03-16 Thread Scott Carey
Yes, I am referring to HDFS taking multiple mounts points and automatically 
round-robin block allocation across it.
A single file block will only exist on a single disk, but the extra speed you 
can get with raid-0 within a block can't be used effectively by almost any 
mapper or reducer anyway.  Perhaps an identity mapper can read faster than a 
single disk - but certainly not if the content is compressed.  \

RAID-0 may be more useful for local temp space.

In effect, you can say that HDFS data nodes already do RAID-0, but with a very 
large block size, and where failure of a disk reduces the redundancy minimally 
and temporarily.

For reference, today's Intel / AMD CPUs can decompress a gzip stream at less 
than 30MB/sec usually  (50MB to 100MB of uncompressed data output a sec).


On 3/14/09 1:53 AM, Vadim Zaliva kroko...@gmail.com wrote:

Scott,

Thanks for interesting information. By JBOD, I assume you mean just listing
multiple partition mount points in hadoop config?

Vadim

On Fri, Mar 13, 2009 at 12:48, Scott Carey sc...@richrelevance.com wrote:
 On 3/13/09 11:56 AM, Allen Wittenauer a...@yahoo-inc.com wrote:

 On 3/13/09 11:25 AM, Vadim Zaliva kroko...@gmail.com wrote:

When you stripe you automatically make every disk in the system have the
 same speed as the slowest disk.  In our experiences, systems are more likely
 to have a 'slow' disk than a dead one and detecting that is really
 really hard.  In a distributed system, that multiplier effect can have
 significant consequences on the whole grids performance.

 All disk are the same, so there is no speed difference.

There will be when they start to fail. :)



 This has been discussed before:
 http://www.nabble.com/RAID-vs.-JBOD-td21404366.html

 JBOD is going to be better, the only benefit of RAID-0 is slightly easier 
 management in hadoop config, but harder to manage at the OS level.
 When a single JBOD drive dies, you only lose that set of data.  The datanode 
 goes down but a restart brings back up the parts that still exist.  Then you 
 can leave it be while the replacement is procured... With RAID-0 the whole 
 node is down until you get the new drive and recreate the RAID.

 With JBOD, don't forget to set the linux readahead for the drives to a decent 
 level  (you'll gain up to 25% more sequential read throughput depending on 
 your kernel version).  (blockdev -setra 8192 /dev/device).  I also see good 
 gains by using xfs instead of ext3.  For a big shocker check out the 
 difference in time to delete a bunch of large files with ext3 (long time) 
 versus xfs (almost instant).

 For the newer drives, they can do about 120MB/sec at the front of the drive 
 when tuned (xfs, readahead 4096) and the back of the drive is 60MB/sec.  If 
 you are going to not use 100% of the drive for HDFS, use this knowledge and 
 place the partitions appropriately.  The last 20% or so of the drive is a lot 
 slower than the front 60%.  Here is a typical sequential transfer rate chart 
 for a SATA drive as a function of LBA:
 http://www.tomshardware.com/reviews/Seagate-Barracuda-1.5-TB,2032-5.html
 (graphs aare about 3/4 of the way down the page before the comments).




Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Scott Carey
That is a fascinating question.  I would also love to know the reason behind 
this.

If I were to guess I would have thought that smaller keys and heavier values 
would slightly outperform, rather than significantly underperform.  (assuming 
total pair count at each phase is the same).   Perhaps there is room for 
optimization here?



On 3/10/09 6:44 PM, Gyanit gya...@gmail.com wrote:



I have large number of key,value pairs. I don't actually care if data goes in
value or key. Let me be more exact.
(k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
pair. I can put it in keys or values.
I have experimented with both options (heavy key , light value)  vs (light
key, heavy value). It turns out that hk,lv option is much much better than
(lk,hv).
Has someone else also noticed this?
Is there a way to make things faster in light key , heavy value option. As
some application will need that also.
Remember in both cases we are talking about atleast dozen or so million
pairs.
There is a difference of time in shuffle phase. Which is weird as amount of
data transferred is same.

-gyanit
--
View this message in context: 
http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




RE: MapReduce jobs with expensive initialization

2009-03-01 Thread Scott Carey
You could create a singleton class and reference the dictionary stuff in that.  
You would probably want this separate from other classes as to control exactly 
what data is held on to for a long time and what is not.

class Singleton {

private static final _instance Singleton = new Singleton();

private Singleton() {
 ... initialize here, only ever called once per classloader or JVM; 
}

public Singleton getSingleton() {
return _instance;
}

in mapper:

Singleton dictionary = Singleton.getSingleton();

This assumes that each mapper doesn't live in its own classloader space (which 
would make even static singletons not shareable), and has the drawback that 
once initialized, that memory associated with the singleton won't go away until 
the JVM or classloader that hosts it dies. 

I have not tried this myself, and do not know the exact classloader semantics 
used in the new 'persistent' task JVMs.  They could have a classloader per job, 
and dispose of those when the job is complete -- though then it is impossible 
to persist data across jobs but only within them.  Or there could be one 
permanent persisted classloader, or one per task.   All will behave differently 
with respect to statics like the above example.


From: Stuart White [stuart.whi...@gmail.com]
Sent: Saturday, February 28, 2009 6:06 AM
To: core-user@hadoop.apache.org
Subject: MapReduce jobs with expensive initialization

I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my main class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

Thanks!