Text file character encoding

2008-06-02 Thread NOMURA Yoshihide
Hello,
I'm using Hadoop 0.17.0 to analyze some large amount of CSV files.

And I need to read such files in different character encoding from UTF-8,
but I think TextInputFormat doesn't support such character encoding.

I guess LineRecordReader class or Text class should support encoding
settings like this.
 conf.set(io.file.defaultEncoding, MS932);

Is there any plan to supoort different character encoding in
TextInputFormat?

Regards,
-- 
NOMURA Yoshihide:
Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
Tel: 044-754-2675 (Ext: 7112-6358)
Fax: 044-754-2570 (Ext: 7112-3834)
E-Mail: [EMAIL PROTECTED]



Using hadoop to store large backups

2008-06-02 Thread Greg Connor
I'm starting to use Hadoop as a simple storage pool to store backups of large 
things (currently Oracle database backups).  My Hadoop usage is at a pretty 
primitive level so far and I am really only scratching the surface of what it 
can do.  I haven't used map/reduce at all--so far it's just been big file goes 
in, big file comes out.

My first concern is to increase the throughput so that I can put files in 
faster.  Right now I'm doing the equivalent of
  tar -cf - /volume1/data | hadoop fs -put   - /backups/vol1.tar
and I've got 4 of those running in parallel.  Throughput is acceptable, but not 
great... I am able to utilize about 0.5Gbits of my 10G network connection.  
Here's what I've observed so far...
* hadoop fs -put processes have a large virtual size (1000m) but only use a 
small amount of resident/chip ram (like 50M).  This is smaller than the 
configured chunk size (64m)
* The tar processes have also larger virtual size (50m) than their resident 
size (800k).  Not sure what that means... maybe I could force tar's output and 
hadoop's input to have larger buffers?
* I don't think CPU is constrained... there are 4 cpus and they are 60% idle
* Network is not really breaking a sweat, I have a 10G connection on the sender 
and 16 hadoop nodes have 1G connections for receiving.
* I don't think disk IO is limiting me either; a quick test sending cat 
/dev/zero to hadoop fs put instead of tar gives about the same throughput

I haven't changed much over the fresh-install defaults so I'm hoping some 
tuning will help boost the throughput.  I'll probably be trying various things 
but some pointers on where to look first would be welcome.

As I get further down this road, I can imagine I'd like to do some cooler 
things with my cluster than just tar in, tar out.  Eventually I would like to 
do things like:
* dump uncompressed data in and have hadoop compress the stream later using its 
nodes, maybe depending on age
* record checksum/md5/sha1 on the original files, then later ask map/reduce to 
give me a file list with checksums from the saved file
* regularly read the files back again to make sure the checksums are still 
matching
If I'm understanding correctly, these tasks are probably difficult/impossible 
using tar and gzip but would probably be quite easy using a 
padded/splittable archive format (e.g. zip?) and chunkable bzip compression.  
Has this been done before, and/or is anyone working on something similar?

Right now I'm only using tar because the directory I'm reading from has 
symlinks in it--otherwise I would probably just crawl the directory and put one 
file in hadoop for each 1 file on the source.  This would probably take longer 
to write, though maybe not much.  Maybe I'll switch later to putting files in 
as-is (along with a tarfile of symlinks, device special files, or whatever else 
tar handles and hadoop doesn't).  At least then I can do stuff like compress 
and checksum in distributed fashion.  In this case the files aren't small, but 
I might have another application later to store backups of JPG files and in 
that case I would definitely need some sort of many-to-one sequencing many 
input files into one output file.

Though this second part of the message is less important right now, I'd be 
interested in any comments/feedback here, so I can start to read up and get an 
idea of what's ahead of me as I head down this path.

Thanks
gregc


MetricsIntValue/MetricsLongValue publish once

2008-06-02 Thread Ion Badita

Hi,

In javadoc for MetricsIntValue and MetricsLongValue is written: Each 
time its value is set, it is published only *once* at the next update 
call. Looking at the those classes is right they push the data into 
the MetricsRecord only once, but digging dipper into the 
AbstractMericsContext the MetricsRecord data is published on every 
update, because ins not cleared once is published (i think is normal to 
be cleared because MetricsRecod records only instructions for 
changing the bufferedData from MetricsContext, maybe I'm wrong but thats 
what code dose).


Ion


DataNode often self-stopped

2008-06-02 Thread smallufo
Hi

I am simulating a 4-DataNodes environment using VMWare.
I found some data nodes often self-stopped after receiving a large file (or
block).
In fact , not so large , it is just smaller than 10MB.

This is the error messages :

2008-05-27 16:40:54,727 INFO org.apache.hadoop.dfs.DataNode: Received block
blk_3604066788791074317 of size 16777216 from /192.168.10.4
2008-05-27 16:40:54,727 INFO org.apache.hadoop.dfs.DataNode: PacketResponder
0 for block blk_3604066788791074317 terminating
2008-05-27 16:40:54,743 WARN org.apache.hadoop.dfs.DataNode: DataNode is
shutting down: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node
192.168.10.7:50010 is attempting to report storage ID
DS-1812686469-192.168.10.5-50010-1211793342121. Node 192.168.10.6:50010 is
expected to serve this storage.
at
org.apache.hadoop.dfs.FSNamesystem.getDatanode(FSNamesystem.java:3594)
at
org.apache.hadoop.dfs.FSNamesystem.blockReceived(FSNamesystem.java:3102)
at org.apache.hadoop.dfs.NameNode.blockReceived(NameNode.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy4.blockReceived(Unknown Source)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:652)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2667)
at java.lang.Thread.run(Thread.java:619)

2008-05-27 16:40:54,745 INFO org.mortbay.util.ThreadedServer: Stopping
Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50075]
2008-05-27 16:40:54,753 INFO org.mortbay.http.SocketListener: Stopped
SocketListener on 0.0.0.0:50075
2008-05-27 16:40:54,914 INFO org.mortbay.util.Container: Stopped
HttpContext[/static,/static]
2008-05-27 16:40:54,950 INFO org.mortbay.util.Container: Stopped
HttpContext[/logs,/logs]
2008-05-27 16:40:54,951 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-05-27 16:40:55,044 INFO org.mortbay.util.Container: Stopped
WebApplicationContext[/,/]
2008-05-27 16:40:55,044 INFO org.mortbay.util.Container: Stopped
[EMAIL PROTECTED]
2008-05-27 16:40:55,044 INFO org.apache.hadoop.dfs.DataNode: Waiting for
threa
2008-05-27 16:40:55,046 INFO org.apache.hadoop.dfs.DataNode:
192.168.10.7:5001
2008-05-27 16:40:55,284 INFO org.apache.hadoop.dfs.DataBlockScanner: Exiting
D
2008-05-27 16:40:56,047 INFO org.apache.hadoop.dfs.DataNode: Waiting for
threa
2008-05-27 16:40:56,050 INFO org.apache.hadoop.dfs.DataNode:
192.168.10.7:5001
2008-05-27 16:40:56,113 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:


What is the problem ?


Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-02 Thread Steve Loughran

Christophe Taton wrote:

Actually Hadoop could be made more friendly to such realtime Map/Reduce
jobs.
For instance, we could consider running all tasks inside the task tracker
jvm as separate threads, which could be implemented as another personality
of the TaskRunner.
I have been looking into this a couple of weeks ago...
Would you be interested in such a feature?


Why does that have benefits? So that you can share stuff via local data 
structures? Because you'd better be sharing classloaders if you are 
going to play that game. And that is very hard to get right (to the 
extent that I dont think any apache project other than Felix does it well)


Matrix Multiplication Problem

2008-06-02 Thread Hadoop

I downloaded the Matrix Multiplication code from:
http://code.google.com/p/hama/source/browse/trunk/src/java/org/apache/hama/
but I do not know how can I run it in the right way.

Could you please give steps how to run the code?

-- 
View this message in context: 
http://www.nabble.com/Matrix-Multiplication-Problem-tp17599862p17599862.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-02 Thread Christophe Taton
Hi Steve,

On Mon, Jun 2, 2008 at 12:23 PM, Steve Loughran [EMAIL PROTECTED] wrote:

 Christophe Taton wrote:

 Actually Hadoop could be made more friendly to such realtime Map/Reduce
 jobs.
 For instance, we could consider running all tasks inside the task tracker
 jvm as separate threads, which could be implemented as another personality
 of the TaskRunner.
 I have been looking into this a couple of weeks ago...
 Would you be interested in such a feature?


 Why does that have benefits? So that you can share stuff via local data
 structures? Because you'd better be sharing classloaders if you are going to
 play that game. And that is very hard to get right (to the extent that I
 dont think any apache project other than Felix does it well)


The most obvious improvement to my mind concerns the memory footprint of the
infrastructure. Running jobs leads to at least 3 jvms per machine (the data
node, the task tracker and the task), if you forget parallelism and accept
to run only one task per node at a time. This is problematic if you have
machines with low memory capacities.

That said, I agree with your concerns about classloading.
I have actually been thinking that we might try to rely on osgi to do the
job, and package hadoop daemons, jobs and tasks as osgi bundles and
services; but I faced many tricky issues in doing that (the last one being
the resolution of configuration files by the classloaders).
To my mind, one short term and minimal way of achieving this would be to use
a URLClassLoader in conjunction with the hdfs URLStreamHandler, to let the
task tracker run tasks directly...

Cheers,
Christophe T.


Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-02 Thread Alejandro Abdelnur
Yes you would have to do it with classloaders (not 'hello world' but not
'rocket science' either).

You'll be limited on using native libraries, even if you use classloaders
properly as native libs can be loaded only once.

You will have to ensure you get rid of the task classloader once the task is
over (thus removing all singleton stuff that may be in it).

You will have to put in place a security manager for the code running out
the task classloader.

You'll end up doing somemthing similar to servlet containers webapp
classloading model with the extra burden of hot-loading for each task run.
Which in the end may have a similar overhead of bootstrapping a JVM for the
task, this should be measured to see what is the time delta to see if it is
worth the effort.

A


On Mon, Jun 2, 2008 at 3:53 PM, Steve Loughran [EMAIL PROTECTED] wrote:

 Christophe Taton wrote:

 Actually Hadoop could be made more friendly to such realtime Map/Reduce
 jobs.
 For instance, we could consider running all tasks inside the task tracker
 jvm as separate threads, which could be implemented as another personality
 of the TaskRunner.
 I have been looking into this a couple of weeks ago...
 Would you be interested in such a feature?


 Why does that have benefits? So that you can share stuff via local data
 structures? Because you'd better be sharing classloaders if you are going to
 play that game. And that is very hard to get right (to the extent that I
 dont think any apache project other than Felix does it well)



Re: About Metrics update

2008-06-02 Thread lohit
In MetricsIntValue, incrMetrics() was being called on pushMetrics(), instead of 
setMetrics(). This used to cause the values to be incremented periodically.
Thanks,
Lohit

- Original Message 
From: Ion Badita [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Saturday, May 31, 2008 4:11:00 AM
Subject: Re: About Metrics update

Hi,

I'm using version 0.17.
Do you know how is fixed?

Thanks
Ion


lohit wrote:
 Hi Ion,

 Which version of Hadoop are you using? The problem you reported about 
 safeModeTime and fsImageLoadTime keep growing was fixed in 0.18 (or trunk)

 Thanks,
 Lohit

 - Original Message 
 From: Ion Badita [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, May 30, 2008 8:10:52 AM
 Subject: Re: About Metrics update

 Hi,

 I found (because of the Metrics behavior reported in the previous 
 e-mail) some errors in the metrics reported by the NameNodeMetrics:
 safeModeTime and fsImageLoadTime keep growing (they should be the same 
 over time). The mentioned metrics use MetricsIntValue for the values, on 
 MetricsIntValue .pushMetric() if changed field is marked true then the 
 value is published in MetricsRecod else the method does nothing.

 public synchronized void pushMetric(final MetricsRecord mr) {
 if (changed)
   mr.incrMetric(name, value);
 changed = false;
   }

 The problem is in AbstractMetricsContext.update() method, because the 
 metricUpdates are not cleared after been merged in the record's 
 internal data.

 Ion



 Ion Badita wrote:
  
 Hi,

 A looked over the class 
 org.apache.hadoop.metrics.spi.AbstractMetricsContext and i have a 
 question:
 why in the update(MetricsRecordImpl record) metricUpdates Map is not 
 cleared after the updates are merged in metricMap. Because of this on 
 every update() old increments are merged in metricMap. Is this the 
 right behavior?
 If i want to increment only one metric in the record using current 
 implementation is not possible without modifying other metrics that 
 are incremented rare.


 Thanks
 Ion



Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-02 Thread Steve Loughran

Alejandro Abdelnur wrote:

Yes you would have to do it with classloaders (not 'hello world' but not
'rocket science' either).


That's where we differ.

I do actually think that classloaders are incredibly hard to get right, 
and I say that as someone who has single stepped through the Axis2 code 
in terror, and help soak-test Ant so that it doesnt leak even over 
extended builds, but even there we draw the line at saying this lets 
you run forever. In fact, I think apache should only allow people to 
write classloader code that have passed some special malicious 
classloader competence test devised by all the teams.



You'll be limited on using native libraries, even if you use classloaders
properly as native libs can be loaded only once.


Plus there's that mess that is called endorsed libraries, and you have 
to worry about native library leakage.



You will have to ensure you get rid of the task classloader once the task is
over (thus removing all singleton stuff that may be in it).


Which you can only do by getting rid of every single instance of every 
single class .. very, very hard to do this




You will have to put in place a security manager for the code running out
the task classloader.

You'll end up doing somemthing similar to servlet containers webapp
classloading model with the extra burden of hot-loading for each task run.
Which in the end may have a similar overhead of bootstrapping a JVM for the
task, this should be measured to see what is the time delta to see if it is
worth the effort.



It really comes down to start time and total memory footprint. If you 
can afford the startup delay and the memory, then separate processes 
give you best isolation and robustness.


FWIW, in SmartFrog, we let you deploy components (as of last week, 
hadoop server components) in their own processes, by declaring the 
process name to use. These are pooled; you can deploy lots of things 
into a single process; once they are all terminated that process halts 
itself and all is well...there's a single root process on every box that 
can take others. Bringing up a nearly-empty child process is good for 
waiting for a faster deployment of other stuff. One issue here is always 
JVM options; should child processes have different parameters (like max 
heap size) from the root process.


For long lived apps, deploying things into child processes is the most 
robust; it keeps the root process lighter. Deploying into the root 
process is better for debugging (you just start that one process), and 
for simplifying liveness. When the root process dies, everything inside 
is guaranteed 100% dead. But the child processes have to wait to 
discover they have lost any links to a parent on that process (if they 
care about such things) and start timing out themselves.


-steve


Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel

if you use the new scripts in 0.17.0, just run

 hadoop-ec2 proxy cluster-name

this starts a ssh tunnel to your cluster.

installing foxy proxy in FF gives you whole cluster visibility..

obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.


On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote:


Hi!

I just wondered what other people use to access the hadoop webservers,
when running on EC2?

Ideas that I had:
1.) opening ports 50030 and so on = not good, data goes unprotected
over the internet. Even if I could enable some form of  
authentication it

would still plain http.

2.) Some kind of tunneling solution. The problem on this side is that
each of my cluster node is in a different subnet, plus the dualism
between the internal and external addresses of the nodes.

Any hints? TIA,

Andreas


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: hadoop on EC2

2008-06-02 Thread Chris K Wensel


obviously this isn't the best solution if you need to let many semi  
trusted users browse your cluster.



Actually, it would be much more secure if the tunnel service ran on a  
trusted server letting your users connect remotely via SOCKS and then  
browse the cluster. These users wouldn't need any AWS keys etc.



Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






RE: Stack Overflow When Running Job

2008-06-02 Thread Devaraj Das
Hi, do you have a testcase that we can run to reproduce this? Thanks!

 -Original Message-
 From: jkupferman [mailto:[EMAIL PROTECTED] 
 Sent: Monday, June 02, 2008 9:22 AM
 To: core-user@hadoop.apache.org
 Subject: Stack Overflow When Running Job
 
 
 Hi everyone,
 I have a job running that keeps failing with Stack Overflows 
 and I really dont see how that is happening.
 The job runs for about 20-30 minutes before one task errors, 
 then a few more error and it fails.
 I am running hadoop-17 and ive tried lowering these settings 
 to no avail:
 io.sort.factor50
 io.seqfile.sorter.recordlimit 50
 
 java.io.IOException: Spill failed
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
 MapTask.java:594)
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
 MapTask.java:576)
   at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
   at Group.write(Group.java:68)
   at GroupPair.write(GroupPair.java:67)
   at
 org.apache.hadoop.io.serializer.WritableSerialization$Writable
Serializer.serialize(WritableSerialization.java:90)
   at
 org.apache.hadoop.io.serializer.WritableSerialization$Writable
Serializer.serialize(WritableSerialization.java:77)
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa
 sk.java:434)
   at MyMapper.map(MyMapper.java:27)
   at MyMapper.map(MyMapper.java:10)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
 Caused by: java.lang.StackOverflowError
   at java.io.DataInputStream.readInt(DataInputStream.java:370)
   at Group.readFields(Group.java:62)
   at GroupPair.readFields(GroupPair.java:60)
   at
 org.apache.hadoop.io.WritableComparator.compare(WritableCompar
 ator.java:91)
   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa
 sk.java:494)
   at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
   at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
 the above line repeated 200x
 
 I defined writeablecomparable called GroupPair which simply 
 holds to Group objects, each of which contains two integers. 
 I fail to see how QuickSort could recurse 200+ times since 
 that would require an insanely large amount of entries , far 
 more then the 500 million that had been output at that point. 
 
 How is this even possible? And what can be done to fix this?
 --
 View this message in context: 
 http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935
 94p17593594.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Hadoop installation folders in multiple nodes

2008-06-02 Thread Michael Di Domenico
Depending on your windows version, there is a dos command called subst
which you could use to virtualize a drive letter on your third machine

On Fri, May 30, 2008 at 4:35 AM, Sridhar Raman [EMAIL PROTECTED]
wrote:

 Should the installation paths be the same in all the nodes?  Most
 documentation seems to suggest that it is _*recommended*_ to have the
 _*same
 *_ paths in all the nodes.  But what is the workaround, if, because of some
 reason, one isn't able to have the same path?

 That's the problem we are facing right now.  After making Hadoop work
 perfectly in a 2-node cluster, when we tried to accommodate a 3rd machine,
 we realised that this machine doesn't have a E:, which is where the
 installation of hadoop is in the other 2 nodes.  All our machines are
 Windows machines.  The possible solutions are:
 1) Move the installations in M1  M2 to a drive that is present in M3.  We
 will keep this as the last option.
 2) Map a folder in M3's D: to E:.  We used the subst command to do this.
 But when we tried to start DFS, it wasn't able to find the hadoop
 installation.  Just to verify, we tried a ssh to the localhost, and were
 unable to find the mapped drive.  It's only visible as a folder of D:.
 Whereas, in the basic cygwin prompt, we are able to view E:.
 3) Partition M3's D drive and create an E.  This carries the risk of loss
 of
 data.

 So, what should we do?  Is there any way we can specify in the NameNode the
 installation paths of hadoop in each of the remaining nodes?  Or is there
 some environment variable that can be set, which can make the hadoop
 installation path specific to each machine?

 Thanks,
 Sridhar



Re: Hadoop installation folders in multiple nodes

2008-06-02 Thread Michael Di Domenico
Oops, missed the part where you already tried that.

On Mon, Jun 2, 2008 at 3:23 PM, Michael Di Domenico [EMAIL PROTECTED]
wrote:

 Depending on your windows version, there is a dos command called subst
 which you could use to virtualize a drive letter on your third machine


 On Fri, May 30, 2008 at 4:35 AM, Sridhar Raman [EMAIL PROTECTED]
 wrote:

 Should the installation paths be the same in all the nodes?  Most
 documentation seems to suggest that it is _*recommended*_ to have the
 _*same
 *_ paths in all the nodes.  But what is the workaround, if, because of
 some
 reason, one isn't able to have the same path?

 That's the problem we are facing right now.  After making Hadoop work
 perfectly in a 2-node cluster, when we tried to accommodate a 3rd machine,
 we realised that this machine doesn't have a E:, which is where the
 installation of hadoop is in the other 2 nodes.  All our machines are
 Windows machines.  The possible solutions are:
 1) Move the installations in M1  M2 to a drive that is present in M3.  We
 will keep this as the last option.
 2) Map a folder in M3's D: to E:.  We used the subst command to do this.
 But when we tried to start DFS, it wasn't able to find the hadoop
 installation.  Just to verify, we tried a ssh to the localhost, and were
 unable to find the mapped drive.  It's only visible as a folder of D:.
 Whereas, in the basic cygwin prompt, we are able to view E:.
 3) Partition M3's D drive and create an E.  This carries the risk of loss
 of
 data.

 So, what should we do?  Is there any way we can specify in the NameNode
 the
 installation paths of hadoop in each of the remaining nodes?  Or is there
 some environment variable that can be set, which can make the hadoop
 installation path specific to each machine?

 Thanks,
 Sridhar





Re: DataNode often self-stopped

2008-06-02 Thread smallufo
2008/6/3 Konstantin Shvachko [EMAIL PROTECTED]:

 Is it possible that your different data-nodes point to the same storage
 directory on
 the hard drive? If so one of the data-nodes will be shut down.


 In general this is impossible because storage directories are locked once
 one of the nodes
 claims them under its authority. But I don't know whether this work in
 VMWare environment.


No , it is in different storage file.

Is it because of network problem ?
The VMWare-simulated network sometimes inter-ping time out because of host
server's high load.
Is it because of the network temporarily unavailable(for some seconds) ,
results in the data-nodes self shut down ?


create less than 10G data/host with RandomWrite

2008-06-02 Thread Richard Zhang
Hello Hadoopers:
I am running the RandomWrite on a 8 nodes cluster. Because the default
setting is creating 1G/mapper, 10mappers/host. Considering replications, it
is essentially creating 30G/host. Because each node in the cluster has at
most 30G. So my cluster is full and can not execute further command. I
create a new application configuration file specifying 1G/mapper. But it
seems it is still creating 30G data and still running out of each node's
disk. Is this the right way to generate the less than 10G data with
RandomWriter? Below is the command and application configuration file I
used.

 bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter rand  -conf
randConfig.xml

Below is the application configuration file I am using:

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
  nametest.randomwriter.maps_per_host/name
  value10/value
  description/description
/property

property
  nametest.randomwrite.bytes_per_map/name
  value103741824/value
  description/description
/property

property
  nametest.randomwrite.min_key/name
  value10/value
  description
  /description
/property

property
  nametest.randomwrite.max_key/name
 value1000/value
 description
  /description
/property

property
  nametest.randomwrite.min_value/name
 value0/value
 descriptionDefault block replication.
  /description
/property

property
  nametest.randomwrite.max_value/name
 value2/value
 descriptionDefault block replication.
  /description
/property
/configuration