Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Taeho Kang
Yes, it will be kept on the machine you issue the dfs -put command if it's
got a datanode running. Otherwise, a random datanode will be chosen to store
the datablocks.


On Fri, Jun 19, 2009 at 10:41 AM, Rajeev Gupta graj...@in.ibm.com wrote:

 If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets.
 Does that mean that if replication factor is 1, whole file will be kept on
 one node only?

 Thanks and regards.
 -Rajeev Gupta




 Aaron Kimball
 aa...@cloudera.c
 omTo
   core-user@hadoop.apache.org
 06/19/2009 01:56   cc
 AM
   Subject
   Re: HDFS is not loading evenly
 Please respond to across all nodes.
 core-u...@hadoop.
apache.org








 Did you run the dfs put commands from the master node?  If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets. For more balanced loading,
 you should use an off-cluster machine as the point of origin.

 If you experience uneven block distribution, you should also periodically
 rebalance your cluster by running bin/start-balancer.sh every so often. It
 will work in the background to move blocks from heavily-laden nodes to
 underutilized ones.

 - Aaron

 On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
 qiming...@openresearchinc.com wrote:

 
  Hi all
 
  I dfs put a large dataset onto a 10-node cluster.
 
  When I observe the Hadoop progress (via web:50070) and each local file
  system (via df -k),
  I notice that my master node is hit 5-10 times harder than others, so
 hard
  drive is get full quicker than others. Last night load, it actually crash
  when hard drive was full.
 
  To my understand,  data should wrap around all nodes evenly (in a
  round-robin fashion using 64M as a unit).
 
  Is it expected behavior of Hadoop? Can anyone suggest a good
  troubleshooting
  way?
 
  Thanks
 
 
  --
  View this message in context:
 

 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html

  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 





Re: input/output error while setting up superblock

2009-05-22 Thread Taeho Kang
I don't think HDFS is a good place to store your Xen image file as it will
likely be updated/appended frequently in small blocks. With the way HDFS is
designed for, you can't quite use it like a regular filesystem (e.g. ones
that support frequent small block appends/updates in files). My suggestion
is to use another filesystem like NAS or SAN.

/Taeho

2009/5/22 신승엽 mikas...@naver.com

 Hi, I have a problem to use hdfs.

 I mounted hdfs using fuse-dfs.

 I created a dummy file for 'Xen' in hdfs and then formated the dummy file
 using 'mke2fs'.

 But the operation was faced error. The error message is as follows.

 [r...@localhost hdfs]# mke2fs -j -F ./file_dumy
 mke2fs 1.40.2 (12-Jul-2007)
 ./file_dumy: Input/output error while setting up superblock
 Also, I copyed an image file of xen to hdfs. But Xen couldn't the image
 files in hdfs.

 r...@localhost hdfs]# fdisk -l fedora6_demo.img
 last_lba(): I don't know how to handle files with mode 81a4
 You must set cylinders.
 You can do this from the extra functions menu.

 Disk fedora6_demo.img: 0 MB, 0 bytes
 255 heads, 63 sectors/track, 0 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot  Start End  Blocks   Id  System
 fedora6_demo.img1   *   1 156 1253038+  83  Linux

 Could you answer me anything about this problem.

 Thank you.



Reduce won't start until Map stage reaches 100%?

2009-02-08 Thread Taeho Kang
Dear All,

With Hadoop 0.19.0, Reduce stage does not start until Map stage gets to the
100% completion.
Has anyone faced the similar situation?

 ... ...
 -  map 90% reduce 0%
-  map 91% reduce 0%
-  map 92% reduce 0%
-  map 93% reduce 0%
-  map 94% reduce 0%
-  map 95% reduce 0%
-  map 96% reduce 0%
-  map 97% reduce 0%
-  map 98% reduce 0%
-  map 99% reduce 0%
-  map 100% reduce 0%
-  map 100% reduce 1%
-  map 100% reduce 2%
-  map 100% reduce 3%
-  map 100% reduce 4%
-  map 100% reduce 5%
-  map 100% reduce 6%
-  map 100% reduce 7%
-  map 100% reduce 8%
-  map 100% reduce 9%

Thank you all in advance,

/Taeho


Re: Transferring data between different Hadoop clusters

2009-02-02 Thread Taeho Kang
Thanks for your prompt reply.

When using the command
./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path

- Should this command be given in cluster1?
- What does port 50070 specify? Is it the one in fs.default.name, or
dfs.http.address?

/Taeho



On Mon, Feb 2, 2009 at 12:40 PM, Mark Chadwick mchadw...@invitemedia.comwrote:

 Taeho,

 The distcp command is perfect for this.  If you're copying between two
 clusters running the same version of Hadoop, you can do something like:

 ./bin/hadoop distcp hdfs://cluster1/path hdfs://cluster2/path

 If you're copying between 0.18 and 0.19, the command will look like:

 ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path

 Hope that helps,
 -Mark

 On Sun, Feb 1, 2009 at 9:48 PM, Taeho Kang tka...@gmail.com wrote:

  Dear all,
 
  There have been times where I needed to transfer some big data from one
  version of Hadoop cluster to another.
  (e.g. from hadoop 0.18 to hadoop 0.19 cluster)
 
  Other than copying files from one cluster to a local file system and
 upload
  it to another,
  is there a tool that does it?
 
  Thanks in advance,
  Regards,
 
  /Taeho
 



Datanode log for errors

2008-11-25 Thread Taeho Kang
Hi,

I have encountered some IOExceptions in Datanode, while some
intermediate/temporary map-reduce data is written to HDFS.

2008-11-25 18:27:08,070 INFO org.apache.hadoop.dfs.DataNode: writeBlock
blk_-460494523413678075 received exception java.io.IOException: Block
blk_-460494523413678075 is valid, and cannot be written to.
2008-11-25 18:27:08,070 ERROR org.apache.hadoop.dfs.DataNode:
10.31.xx.xxx:50010:DataXceiver: java.io.IOException: Block
blk_-460494523413678075 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:616)
at
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1995)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)
It looks like one of the HDD partitons has a problem with being written to,
but the log doesn't show which partition.
Is there a way to find it out?

(Or it could be a new feature for the next version...)

Thanks in advance,

/Taeho


Re: Question on opening file info from namenode in DFSClient

2008-11-07 Thread Taeho Kang
Hi, thanks for your reply Dhruba,

One of my co-workers is writing a BigTable-like application that could be
used for online, near-real-time, services. So since the application could be
hooked into online services, there would times when a large number of users
(e.g. 1000 users) request to access few files in a very short time.

Of course, in a batch process job, this is a rare case, but for online
services, it's quite a common case.
I think HBase developers would have run into similar issues as well.

Is this enough explanation?

Thanks in advance,

Taeho



On Tue, Nov 4, 2008 at 3:12 AM, Dhruba Borthakur [EMAIL PROTECTED] wrote:

 In the current code, details about block locations of a file are
 cached on the client when the file is opened. This cache remains with
 the client until the file is closed. If the same file is re-opened by
 the same DFSClient, it re-contacts the namenode and refetches the
 block locations. This works ok for most map-reduce apps because it is
 rare that the same DSClient re-opens the same file again.

 Can you pl explain your use-case?

 thanks,
 dhruba


 On Sun, Nov 2, 2008 at 10:57 PM, Taeho Kang [EMAIL PROTECTED] wrote:
  Dear Hadoop Users and Developers,
 
  I was wondering if there's a plan to add file info cache in DFSClient?
 
  It could eliminate network travelling cost for contacting Namenode and I
  think it would greatly improve the DFSClient's performance.
  The code I was looking at was this
 
  ---
  DFSClient.java
 
 /**
  * Grab the open-file info from namenode
  */
 synchronized void openInfo() throws IOException {
   /* Maybe, we could add a file info cache here! */
   LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
   if (newInfo == null) {
 throw new IOException(Cannot open filename  + src);
   }
   if (locatedBlocks != null) {
 IteratorLocatedBlock oldIter =
  locatedBlocks.getLocatedBlocks().iterator();
 IteratorLocatedBlock newIter =
  newInfo.getLocatedBlocks().iterator();
 while (oldIter.hasNext()  newIter.hasNext()) {
   if (!
 oldIter.next().getBlock().equals(newIter.next().getBlock()))
  {
 throw new IOException(Blocklist for  + src +  has
 changed!);
   }
 }
   }
   this.locatedBlocks = newInfo;
   this.currentNode = null;
 }
  ---
 
  Does anybody have an opinion on this matter?
 
  Thank you in advance,
 
  Taeho
 



Question on opening file info from namenode in DFSClient

2008-11-02 Thread Taeho Kang
Dear Hadoop Users and Developers,

I was wondering if there's a plan to add file info cache in DFSClient?

It could eliminate network travelling cost for contacting Namenode and I
think it would greatly improve the DFSClient's performance.
The code I was looking at was this

---
DFSClient.java

/**
 * Grab the open-file info from namenode
 */
synchronized void openInfo() throws IOException {
  /* Maybe, we could add a file info cache here! */
  LocatedBlocks newInfo = callGetBlockLocations(src, 0, prefetchSize);
  if (newInfo == null) {
throw new IOException(Cannot open filename  + src);
  }
  if (locatedBlocks != null) {
IteratorLocatedBlock oldIter =
locatedBlocks.getLocatedBlocks().iterator();
IteratorLocatedBlock newIter =
newInfo.getLocatedBlocks().iterator();
while (oldIter.hasNext()  newIter.hasNext()) {
  if (! oldIter.next().getBlock().equals(newIter.next().getBlock()))
{
throw new IOException(Blocklist for  + src +  has changed!);
  }
}
  }
  this.locatedBlocks = newInfo;
  this.currentNode = null;
}
---

Does anybody have an opinion on this matter?

Thank you in advance,

Taeho


Re: Add new data directory during runtime

2008-10-16 Thread Taeho Kang
Since configuration file is loaded when a datanode starts up,
it's not possible to have the change in dfs.datadir applied in runtime.

Please let me know if I'm wrong.




On Fri, Oct 17, 2008 at 10:08 AM, Jinyeon Lee [EMAIL PROTECTED] wrote:

 Is it possible to add more data directories by changing the
 configuration `dfs.data.dir' during runtime?

 Regards,
 Lee, Jin Yeon



Re: dual core configuration

2008-10-08 Thread Taeho Kang
First of all, mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum are both set to 2 in
hadoop-default.xml file; this file is read before hadoop-site.xml file so
any properties that aren't set in hadoop-site.xml will follow the values set
in hadoop-default.xml.
As for the question on why only one core is utilized...
I think it really depends on the process scheduling of the underlying OS.
It's not like two tasks (two JVM subprocesses spawned by the tasktracker)
will always run on independent cores as there are other processes which need
one or more cores to be run.

By the way, what tools did you use to find out which tasks (or processes)
use which cores?

/Taeho


On Wed, Oct 8, 2008 at 1:01 PM, Alex Loddengaard
[EMAIL PROTECTED]wrote:

 Taeho, I was going to suggest this change as well, but it's documented that
 mapred.tasktracker.map.tasks.maximum defaults to 2.  Can you explain why
 Elia is only having one core utilized when this config option is set to 2?

 Here is the documentation I'm referring to:
 http://hadoop.apache.org/core/docs/r0.18.1/cluster_setup.html

 Alex

 On Tue, Oct 7, 2008 at 8:27 PM, Taeho Kang [EMAIL PROTECTED] wrote:

  You can have your node (tasktracker) running more than 1 task
  simultaneously.
  You may set mapred.tasktracker.map.tasks.maximum and
  mapred.tasktracker.reduce.tasks.maximum properties found in
  hadoop-site.xml file. You should change hadoop-site.xml file on all your
  slave nodes depending on how many cores each slave has. For example, you
  don't really want to have 8 tasks running at once on a 2 core machine.
 
  /Taeho
 
  On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi
  [EMAIL PROTECTED]wrote:
 
   hello,
  
   I have some dual core nodes, and I've noticed hadoop is only running 1
   instance, and so is only using 1 on the CPU's on each node.
   is there a configuration to tell it to run more than once?
   or do i need to turn each machine into 2 nodes?
  
   Thanks.
  
 



Re: dual core configuration

2008-10-07 Thread Taeho Kang
You can have your node (tasktracker) running more than 1 task
simultaneously.
You may set mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum properties found in
hadoop-site.xml file. You should change hadoop-site.xml file on all your
slave nodes depending on how many cores each slave has. For example, you
don't really want to have 8 tasks running at once on a 2 core machine.

/Taeho

On Wed, Oct 8, 2008 at 5:53 AM, Elia Mazzawi
[EMAIL PROTECTED]wrote:

 hello,

 I have some dual core nodes, and I've noticed hadoop is only running 1
 instance, and so is only using 1 on the CPU's on each node.
 is there a configuration to tell it to run more than once?
 or do i need to turn each machine into 2 nodes?

 Thanks.



Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Taeho Kang
Adding your jar files in the $HADOOP_HOME/lib folder works, but you would
have to restart all your tasktrackers to have your jar files loaded.

If you repackage your map-reduce jar file (e.g. hadoop-0.18.0-examples.jar)
with your jar file and run your job with the newly repackaged jar file, it
would work, too.

On Tue, Oct 7, 2008 at 6:55 AM, Tarandeep Singh [EMAIL PROTECTED] wrote:

 thanks Mahadev for the reply.
 So that means I have to copy my jar file in the $HADOOP_HOME/lib folder on
 all slave machines like before.

 One more question- I am adding a conf file (just like HADOOP_SITE.xml) via
 -conf option and I am able to query parameters in mapper/reducers. But is
 there a way I can query the parameters in my job driver class -

 public class jobDriver extends Configured
 {
   someMethod( )
   {
  ToolRunner.run( new MyJob( ), commandLineArgs);
  // I want to query parameters present in my conf file here
   }
 }

 public class MyJob extends Configured implements Tool
 {
 }

 Thanks,
 Taran

 On Mon, Oct 6, 2008 at 2:46 PM, Mahadev Konar [EMAIL PROTECTED]
 wrote:

  HI Tarandeep,
   the libjars options does not add the jar on the client side. Their is an
  open jira for that ( id ont remember which one)...
 
  Oyu have to add the jar to the
 
  HADOOP_CLASSPATH on the client side so that it gets picked up on the
 client
  side as well.
 
 
  mahadev
 
 
  On 10/6/08 2:30 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
 
   Hi,
  
   I want to add a jar file (that is required by mappers and reducers) to
  the
   classpath. Initially I had copied the jar file to all the slave nodes
 in
  the
   $HADOOP_HOME/lib directory and it was working fine.
  
   However when I tried the libjars option to add jar files -
  
   $HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars
  jdom.jar
  
  
   I got this error-
  
   java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder
  
   Can someone please tell me what needs to be fixed here ?
  
   Thanks,
   Taran
 
 



Re: nagios to monitor hadoop datanodes!

2008-10-06 Thread Taeho Kang
The easiest approach I can think of is to write a simple Nagios plugin that
checks if the datanode JVM process is alive. Or you may
write a Nagios-plugin that checks for error or warning messages in datanode
logs. (I am sure you can find quite a few log-checking Nagios plugin in
nagiosplugin.org)

If you are unsure of how to write nagios-plugin, I suggest you to read stuff
from link Leverage Nagios with plug-ins you write
http://www.ibm.com/developerworks/aix/library/au-nagios/ as it's got good
explanations and examples on how to write nagios plugin.

Or if you've got time to burn, you might want to read Nagios documentation,
too.

Let me know if you need help on this matter.

/Taeho



On Tue, Oct 7, 2008 at 2:05 AM, Gerardo Velez [EMAIL PROTECTED]wrote:

 Hi Everyone!


 I would like to implement Nagios health monitoring of a Hadoop grid.

 Some of you have some experience here, do you hace any approach or advice I
 could use.

 At this time I've been only playing with jsp's files that hadoop has
 integrated into it. so I;m not sure if it could be a good idea that
 nagios monitor request info to these jsp?


 Thanks in advance!


 -- Gerardo



Questions on dfs.datanode.du.reserved

2008-10-01 Thread Taeho Kang
Dear All,
I have few questions on dfs.datanode.du.reserved property in
hadoop-site.xml configuration...

Assuming that I have dfs.datanode.du.reserved = 10GB and the partition
assigned for HDFS has already been filled up to its capacity.
(In this case, it will be Total disk size minus 10GB)
What happens if I change dfs.datanode.du.reserved value to something
greater than 10GB, like 20GB?
Will HDFS remove or move blocks to meet that settings?

Also, is it possible to set that dfs.datanode.du.reserved separately on
each partition?
(e.g. reserve 30GB for /data1 partition, reserve 100GB for /data2 partition)

Many thanks,

Taeho


Re: How to order all the output file if I use more than one reduce node?

2008-08-06 Thread Taeho Kang
You may want to write a partitioner that partitions the output from mappers
in a way that fits your definition of sorted data (e.g. all keys in
part-1 are greater than those in part-0.) Once you've done it, just
merging all the reduce output from 0 to N will give you a sorted result
file.


On Thu, Aug 7, 2008 at 10:26 AM, Kevin [EMAIL PROTECTED] wrote:

 I suppose you meant to sort the result globally across files. AFAIK,
 This is not currently supported unless you have only one reducer. It
 is said that version 0.19 will introduce such capability.

 -Kevin



 On Wed, Aug 6, 2008 at 6:01 PM, Xing [EMAIL PROTECTED] wrote:
  If I use one node for reduce, hadoop can sort the result.
  If I use 30 nodes for reduce, the result is part-0 ~ part-00029.
  How make all the 30 parts sort globally and all the files in part-1
 are
  greater that part-0 ?
  Thanks a lot
 
  Xing
 



Re: Are lines broken in dfs and/or in InputSplit

2008-08-06 Thread Taeho Kang
I guess a quick way to find an answer for your question is to look at size
of data block files stored in datanodes.

If they are all the same (e.g. 64MB), then you could say lines are NOT
preserved in block level as DFS simply cuts the original file into exact
64MB pieces.

They are almost all the same, by the way, except for few blocks which may
represent files smaller than 64MB or some blocks that may represent the end
blocks of a file.

/Taeho


On Thu, Aug 7, 2008 at 9:23 AM, Kevin [EMAIL PROTECTED] wrote:

 Hi,

 I guess this thread is old. But I eventually need to raise the
 question again as I am more into dfs now. Would a line be broken
 between adjacent blocks in dfs? Can line be preserved in block level?

 -Kevin



 On Wed, Jul 16, 2008 at 4:57 PM, Chris Douglas [EMAIL PROTECTED]
 wrote:
  InputFormats don't have a concept of blocks; each FileSplit contains a
  list of locations that advise the framework where it should prefer to
  schedule the map (i.e. on the node that contains most of the data (in
  practice, IIRC this is the the location of the first byte of the block,
  which may not actually contain the bulk of the data)). For
 LineRecordReader,
  this means that it will open a stream, seek to its start position, read
  (opening up a connection to the node that contains that block, with luck
 a
  local read) to the first record delimiter, then return lines as Text
 records
  to the map until the end of that split precedes the start offset at the
  beginning of a read (i.e. the end of split A and the start of split B
 will
  likely be in the middle of a record, so A will emit that record and B
 will
  start from the end of that record).
 
  I think it's fair to say that blocks and records are orthogonal
 abstractions
  to HDFS and map/reduce. -C
  
  On Jul 15, 2008, at 5:07 PM, Kevin wrote:
 
  Hi,
 
  I was trying to parse text input with line-based information in mapper
  and this problem becomes an issue. I wonder if lines are preserved or
  broken when a file is cut into blocks by dfs. Also, it looks that
  although TextInputFormat breaks file into lines records, the
  InputSplit passed to InputFormat may not preserve lines. If this is
  the case, is it possible to restore the lines for mapper input, or I
  have to drop broken lines? Thank you.
 
  Best,
  -Kevin
 
 



Re: java.io.IOException: Cannot allocate memory

2008-07-31 Thread Taeho Kang
Are you using HadoopStreaming?

If so, then subprocess created by HadoopStreaming Job can take as much
memory as it needs. In that case, the system will run out of memory and
other processes (e.g. TaskTracker) may not be able to run properly or even
be killed by the OS.

/Taeho

On Fri, Aug 1, 2008 at 2:24 AM, Xavier Stevens [EMAIL PROTECTED]wrote:

 We're currently running jobs on machines with around 16GB of memory with
 8 map tasks per machine.  We used to run with max heap set to 2048m.
 Since we started using version 0.17.1 we've been getting a lot of these
 errors:

 task_200807251330_0042_m_000146_0: Caused by: java.io.IOException:
 java.io.IOException: Cannot allocate memory
 task_200807251330_0042_m_000146_0:  at
 java.lang.UNIXProcess.init(UNIXProcess.java:148)
 task_200807251330_0042_m_000146_0:  at
 java.lang.ProcessImpl.start(ProcessImpl.java:65)
 task_200807251330_0042_m_000146_0:  at
 java.lang.ProcessBuilder.start(ProcessBuilder.java:451)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.util.Shell.run(Shell.java:134)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathF
 orWrite(LocalDirAllocator.java:296)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllo
 cator.java:124)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFil
 e.java:107)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.ja
 va:734)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.jav
 a:272)
 task_200807251330_0042_m_000146_0:  at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask
 .java:707)

 We haven't changed our heapsizes at all.  Has anyone else experienced
 this?  Is there a way around it other than reducing heap sizes
 excessively low?  I've tried all the way down to 1024m max heap and I
 still get this error.


 -Xavier




Re: Name node heap space problem

2008-07-24 Thread Taeho Kang
Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with export HADOOP_HEAPSIZE=1000

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer [EMAIL PROTECTED]
wrote:

 Update on this one...

 I put some more memory in the machine running the name node. Now fsck is
 running. Unfortunately ls fails with a time-out.

 I identified one directory that causes the trouble. I can run fsck on it
 but not ls.

 What could be the problem?

 Gert

 Gert Pfeifer schrieb:

 Hi,
 I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
 and one secondary name node.

 I have 1788874 files and directories, 1465394 blocks = 3254268 total.
 Heap Size max is 3.47 GB.

 My problem is that I produce many small files. Therefore I have a cron
 job which just runs daily across the new files and copies them into
 bigger files and deletes the small files.

 Apart from this program, even a fsck kills the cluster.

 The problem is that, as soon as I start this program, the heap space of
 the name node reaches 100 %.

 What could be the problem? There are not many small files right now and
 still it doesn't work. I guess we have this problem since the upgrade to
 0.17.

 Here is some additional data about the DFS:
 Capacity :   2 TB
 DFS Remaining   :   1.19 TB
 DFS Used:   719.35 GB
 DFS Used%   :   35.16 %

 Thanks for hints,
 Gert





Re: more than one reducer?

2008-07-21 Thread Taeho Kang
I don't know if there is any in-place mechanism for what you're looking for.


However, you could write a partitioner that distributes data in a way that
lower keys go to lower numbered reduce, and higher keys go to higher
numbered reduce. (e.g. Key starting with 'A~D' goes to part-, 'E~H' goes
to part-0001, and so on.) If you knew how well keys are distributed
beforehand, then you could distribute data quite equally to each reducer as
well.

When you are done, simply download the result files and just merge them
together and you have sorted output.



On Tue, Jul 22, 2008 at 9:08 AM, Mori Bellamy [EMAIL PROTECTED] wrote:

 hey all,
 i was wondering if its possible to split up the reduce task amongst more
 than one machine. i figured it might be possible for the map output to be
 copied to multiple machines; then each reducer could sort its keys and then
 combine them into one big sorted output (a la mergesort). does anybody know
 if there is an in-place mechanism for this?



Re: Timeouts when running balancer

2008-07-20 Thread Taeho Kang
By setting dfs.balance.bandwidthPerSec to 1GB/sec, each datanode is able
to utilize up to 1GB/sec for block balancing. It seems to be too high as
even a gigabit ethernet can't handle that much data per sec.

When you get timeouts, it probably means your network is saturated. Maybe
you were running a big map reduce job which required lots of data transfer
among nodes by then?

Try setting it to be 10~30MB/sec and see what happens.

On Sat, Jul 19, 2008 at 1:56 AM, David J. O'Dell [EMAIL PROTECTED]
wrote:

 I'm trying to re balance my cluster as I've added to more nodes.
 When I run balancer with the default threshold I am seeing timeouts in
 the logs:

 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Decided to
 move block -8432927406854991437 with a length of 128 MB bytes from
 10.11.6.234:50010 to 10.11.6.235:50010 using proxy source
 10.11.6.234:50010
 2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Starting
 Block mover for -8432927406854991437 from 10.11.6.234:50010 to
 10.11.6.235:50010
 2008-07-18 09:52:46,826 WARN org.apache.hadoop.dfs.Balancer: Timeout
 moving block -8432927406854991437 from 10.11.6.234:50010 to
 10.11.6.235:50010 through 10.11.6.234:50010

 I read in the balancer guide-
 http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2
 That the default transfer rate is 1mb/sec
 I tried increasing this to 1gb/sec but I'm still seeing the timeouts.
 All of the nodes have gigE nics and are on the same switch.


 --
 David O'Dell
 Director, Operations
 e: [EMAIL PROTECTED]
 t:  (415) 738-5152
 180 Townsend St., Third Floor
 San Francisco, CA 94107




Re: newbie in streaming: How to execute a single executable

2008-07-13 Thread Taeho Kang
1. You will have to modify your c++ binary (or any other binary) in a way
that
it takes input from stdin
and outputs to stdout.

2. If you run your job as a mapper only job, you'll have as many result
files as the number of mappers created.

On Fri, Jul 11, 2008 at 4:14 AM, Charan Thota [EMAIL PROTECTED]
wrote:


 Hi,

  I'm a newbie in streaming in hadoop. I want to know how to execute a
 single c++ executable?
 Should it be a mapper only job? the executable is to cluster a set of
 points present in
 a file.
 so, it cannot be really said to be a mapper or reducer.Also, there is no
 code
 present,except for the executable.

  please tell me how to execute this on hadoop. is there any other way
 (apart from
 streaming) to do this?

 Thank you

 Charan T.

 --
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.




MapReduce with multi-languages

2008-07-08 Thread Taeho Kang
Dear Hadoop User Group,

What are elegant ways to do mapred jobs on text-based data encoded with
something other than UTF-8?

It looks like Hadoop assumes the text data is always in UTF-8 and handles
data that way - encoding with UTF-8 and decoding with UTF-8.
And whenever the data is not in UTF-8 encoded format, problems arise.

Here is what I'm thinking of to clear the situation.. correct and advise me
if you see my approaches look bad!

(1) Re-encode the original data with UTF-8?
(2) Replace the part of source code where UTF-8 encoder and decoder are
used?

Or has anyone of you guys had trouble with running map-red job on data with
multi-languages?

Any suggestions/advices are welcome and appreciated!

Regards,

Taeho


Re: Inconsistency in namenode's and datanode's namespaceID

2008-07-02 Thread Taeho Kang
No, I don't think it's a bug.

Your datanodes' data partition/directory was probably used in other HDFS
setup and thus had other namespaceID.

Or you could've used other partition/directory for your new HDFS setup by
setting different values for dfs.data.dir on your datanode. But in this
case, you can't access your old HDFS's data.


On Thu, Jul 3, 2008 at 4:21 AM, Xuan Dzung Doan [EMAIL PROTECTED]
wrote:

 I was following the quickstart guide to run pseudo-distributed operations
 with Hadoop 0.16.4. I got it to work successfully the first time. But I
 failed to repeat the steps (I tried to re-do everything from re-formating
 the HDFS). Then by looking at the log files of the daemons, I found out the
 datanode failed to start because its namespaceID didn't match with the
 namenode's. I after that found that the namespaceID is stored in the text
 file VERSION under dfs/data/current and dfs/name/current for the datanode
 and the namenode, respectively. The reformatting step does change
 namespaceID of the namenode, but not for the datanode, and that's the cause
 for the inconsistency. So after reformatting, if I manually update
 namespaceID for the datanode, things will work totally fine again.

 I guess there are probably others who had this same experience. Is it a bug
 in Hadoop 0.16.4? If so, has it been taken care of in later versions?

 Thanks,
 David.






Question on HadoopStreaming and Memory Usage

2008-06-15 Thread Taeho Kang
Dear All,

I've got a question about hadoop streaming with its memory management.
Does hadoop streaming have a mechanism to prevent over-usage of memory by
its subprocesses (Map or Reduce function)?

Say, a binary used for reduce phase allocates itself lots and lots of memory
to the point it starves other important processes like a Datanode or
TaskTracker process. Does Hadoop Streaming prevent such cases?

Thank you in advance,

Taeho


Re: Questions on how to use DistributedCache

2008-05-23 Thread Taeho Kang
Thank you for your clarification!

One more question here,

The API doc says...
DistributedCache is a facility provided by the Map-Reduce framework to
cache files (text, archives, jars etc.) needed by applications.

My question is...
Is it also possible distribute some some binary files (to be executed in
slave nodes in a MapReduce job)??

p.s. I have tried it, it's not been successful. Is this normal?

/Taeho


On Thu, May 22, 2008 at 7:15 PM, Devaraj Das [EMAIL PROTECTED] wrote:



  -Original Message-
  From: Taeho Kang [mailto:[EMAIL PROTECTED]
  Sent: Thursday, May 22, 2008 3:41 PM
  To: core-user@hadoop.apache.org
  Subject: Re: Questions on how to use DistributedCache
 
  Thanks for your reply.
 
  Just one more thing to ask..
 
  From what I see from the source code,
  it looks like the files/jars registered in DistributedCache
  gets uploaded to DFS and then downloaded to slave nodes.
 
  Is there a way I can specify the path in the slave nodes
  where files/jars get downloaded to?

 No that is not possible. They get localized to specific directories (as per
 mapred.local.dir). The files are optionally symlinked in the current
 working
 directory of the task.

 
  /Taeho
 
 
  On Thu, May 22, 2008 at 4:20 PM, Arun C Murthy
  [EMAIL PROTECTED] wrote:
 
  
   On May 21, 2008, at 10:45 PM, Taeho Kang wrote:
  
Dear all,
  
   I am trying to use DistributedCache class for distributing files
   required for running my jobs.
  
   While API documentation provides good guidelines, Is there
  any tips
   or usage examples (e.g. sample codes)?
  
  
   http://hadoop.apache.org/core/docs/current/
   mapred_tutorial.html#DistributedCache
   and
   http://hadoop.apache.org/core/docs/current/
   mapred_tutorial.html#Example%3A+WordCount+v2.0
  
   Arun
  
  
If you could share your experience with me, I would really
  appreciate it.
  
   Thank you in advance,
  
   /Taeho
  
  
  
 




Questions on how to use DistributedCache

2008-05-21 Thread Taeho Kang
Dear all,

I am trying to use DistributedCache class for distributing files required
for running my jobs.

While API documentation provides good guidelines,
Is there any tips or usage examples (e.g. sample codes)?

If you could share your experience with me, I would really appreciate it.

Thank you in advance,

/Taeho


Re: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread Taeho Kang
Thank you for the clarification.

Here is my another question.
If two different clients ordered move to trash with different interval,
(e.g. client #1 with fs.trash.interval = 60; client #2 with
fs.trash.interval = 120)
what would happen?

Does namenode keep track of all these info?

/Taeho


On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote:

 The trash feature is a client side option and depends on the client
 configuration file. If the client's configuration specifies that Trash
 is enabled, then the HDFS client invokes a rename to Trash instead of
 a delete. Now, if Trash is enabled on the Namenode, then the
 Namenode periodically removes contents from the Trash directory.

 This design might be confusing to some users. But it provides the
 flexibility that different clients in the cluster can have either Trash
 enabled or disabled.

 Thanks,
 dhruba

 -Original Message-
 From: Taeho Kang [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 19, 2008 3:13 AM
 To: [EMAIL PROTECTED]; core-user@hadoop.apache.org;
 [EMAIL PROTECTED]
 Subject: Trash option in hadoop-site.xml configuration.

 Hello,

 I have these two machines that acts as a client to HDFS.

 Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60)
 and Node #2 has Trash option off (e.g. fs.trash.interval set to 0)

 When I order file deletion from Node #2, the file gets deleted right
 away.
 while the file gets moved to trash when I do the same from Node #1.

 This is a bit of surprise to me,
 because I thought Trash option that I have set in the master node's
 config
 file
 applies to everyone who connects to / uses the HDFS.

 Was there any reason why Trash option was implemented in this way?

 Thank you in advance,

 /Taeho



Trash option in hadoop-site.xml configuration.

2008-03-19 Thread Taeho Kang
Hello,

I have these two machines that acts as a client to HDFS.

Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60)
and Node #2 has Trash option off (e.g. fs.trash.interval set to 0)

When I order file deletion from Node #2, the file gets deleted right away.
while the file gets moved to trash when I do the same from Node #1.

This is a bit of surprise to me,
because I thought Trash option that I have set in the master node's config
file
applies to everyone who connects to / uses the HDFS.

Was there any reason why Trash option was implemented in this way?

Thank you in advance,

/Taeho


Upgrade Hadoop from 0.12 to 0.16 - don't do it!!

2008-03-06 Thread Taeho Kang
Hello all,

I wanted to share my experience with you who wanted to upgrade Hadoop from
0.12 or previous versions to more recent versions like 0.16

After installing 0.16 and tried start-dfs.sh, Namenode gave me an exception
saying I had to use -upgrade option.
I gave -upgrade option and Namenode and Datanodes came up alright.
But when I tried finalizing the upgrade, but it didn't work, nor did
-rollback option work.
From there on, the only way I could have the cluster up and running was to
give -upgrade option.

So here is my advice : Move to 0.13 first and then do the upgrade from
there.

However, following the steps found in the wiki, I was able to upgrade from
0.12 to 0.13 alright.
I hope it's not going to be too painful upgrading from 0.13 to 0.14 or
upwards, using -upgrade / -rollback / -finalize options :-)

Also, if anybody wanted to share any good or painful experience with me, I
would really appreciate it!

/Taeho