Re: large files vs many files

2009-05-09 Thread Sasha Dolgy
yes, that is the problem.  two or hundreds...data streams in very quickly.

On Fri, May 8, 2009 at 8:42 PM, jason hadoop jason.had...@gmail.com wrote:

 Is it possible that two tasks are trying to write to the same file path?


 On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote:

  Hi Tom (or anyone else),
  Will SequenceFile allow me to avoid problems with concurrent writes to
 the
  file?  I stll continue to get the following exceptions/errors in hdfs:
 
  org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
  failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
  DFSClient_-1821265528
  on client 127.0.0.1 because current leaseholder is trying to recreate
 file.
 
  Only happens when two processes are trying to write at the same time.
  Now
  ideally I don't want to buffer the data that's coming in and i want to
 get
  it out and into the file asap to avoid any data loss...am i missing
  something here?  is there some sort of factory i can implement to help in
  writing a lot of simultaneous data streams?
 
  thanks in advance for any suggestions
  -sasha
 
  On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote:
 
   Hi Sasha,
  
   As you say, HDFS appends are not yet working reliably enough to be
   suitable for production use. On the other hand, having lots of little
   files is bad for the namenode, and inefficient for MapReduce (see
   http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
   it's best to avoid this too.
  
   I would recommend using SequenceFile as a storage container for lots
   of small pieces of data. Each key-value pair would represent one of
   your little files (you can have a null key, if you only need to store
   the contents of the file). You can also enable compression (use block
   compression), and SequenceFiles are designed to work well with
   MapReduce.
  
   Cheers,
  
   Tom
  
   On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com
   wrote:
hi there,
working through a concept at the moment and was attempting to write
  lots
   of
data to few files as opposed to writing lots of data to lots of
 little
files.  what are the thoughts on this?
   
When I try and implement outputStream = hdfs.append(path); there
  doesn't
seem to be any locking mechanism in place ... or there is and it
  doesn't
work well enough for many writes per second?
   
i have read and seen that the property dfs.support.append is not
  meant
   for
production use.  still, if millions of little files are as good or
  better
--- or no difference -- to a few massive files then i suppose append
   isn't
something i really need.
   
I do see a lot of stack traces with messages like:
   
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed
 to
create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
   client
127.0.0.1 because current leaseholder is trying to recreate file.
   
i hope this make sense.  still a little bit confused.
   
thanks in advance
-sd
   
--
Sasha Dolgy
sasha.do...@gmail.com
  
 



 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: Huge DataNode Virtual Memory Usage

2009-05-09 Thread Stefan Will
Chris,

Thanks for the tip ... However I'm already running 1.6_10:

java version 1.6.0_10
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Do you know of a specific bug # in the JDK bug database that addresses this
?

Cheers,
Stefan


 From: Chris Collins ch...@scoutlabs.com
 Reply-To: core-user@hadoop.apache.org
 Date: Fri, 8 May 2009 20:34:21 -0700
 To: core-user@hadoop.apache.org core-user@hadoop.apache.org
 Subject: Re: Huge DataNode Virtual Memory Usage
 
 Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It
 manifested itself during major GC.  We saw this on linux and solaris
 and dramatically improved with an upgrade.
 
 C
 On May 8, 2009, at 6:12 PM, Stefan Will wrote:
 
 Hi,
 
 I just ran into something rather scary: One of my datanode processes
 that
 I¹m running with ­Xmx256M, and a maximum number of Xceiver threads
 of 4095
 had a virtual memory size of over 7GB (!). I know that the VM size
 on Linux
 isn¹t necessarily equal to the actual memory used, but I wouldn¹t
 expect it
 to be an order of magnitude higher either. I ran pmap on the
 process, and it
 showed around 1000 thread stack blocks with roughly 1MB each (which
 is the
 default size on the 64bit JDK). The largest block was 3GB in size
 which I
 can¹t figure out what it is for.
 
 Does anyone have any insights into this ? Anything that can be done to
 prevent this other than to restart the DFS regularly ?
 
 -- Stefan




Re: large files vs many files

2009-05-09 Thread jason hadoop
You must create unique file names, I don't believe (but I do not know) that
the append could will allow multiple writers.

Are you writing from within a task, or as an external application writing
into hadoop.

You may try using UUID,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of your
filename.
Without knowing more about your goals, environment and constraints it is
hard to offer any more detailed suggestions.
You could also have an application aggregate the streams and write out
chunks, with one or more writers, one per output file.


On Sat, May 9, 2009 at 12:15 AM, Sasha Dolgy sdo...@gmail.com wrote:

 yes, that is the problem.  two or hundreds...data streams in very quickly.

 On Fri, May 8, 2009 at 8:42 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Is it possible that two tasks are trying to write to the same file path?
 
 
  On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote:
 
   Hi Tom (or anyone else),
   Will SequenceFile allow me to avoid problems with concurrent writes to
  the
   file?  I stll continue to get the following exceptions/errors in hdfs:
  
   org.apache.hadoop.ipc.RemoteException:
   org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
   failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
   DFSClient_-1821265528
   on client 127.0.0.1 because current leaseholder is trying to recreate
  file.
  
   Only happens when two processes are trying to write at the same time.
   Now
   ideally I don't want to buffer the data that's coming in and i want to
  get
   it out and into the file asap to avoid any data loss...am i missing
   something here?  is there some sort of factory i can implement to help
 in
   writing a lot of simultaneous data streams?
  
   thanks in advance for any suggestions
   -sasha
  
   On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote:
  
Hi Sasha,
   
As you say, HDFS appends are not yet working reliably enough to be
suitable for production use. On the other hand, having lots of little
files is bad for the namenode, and inefficient for MapReduce (see
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/),
 so
it's best to avoid this too.
   
I would recommend using SequenceFile as a storage container for lots
of small pieces of data. Each key-value pair would represent one of
your little files (you can have a null key, if you only need to store
the contents of the file). You can also enable compression (use block
compression), and SequenceFiles are designed to work well with
MapReduce.
   
Cheers,
   
Tom
   
On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com
wrote:
 hi there,
 working through a concept at the moment and was attempting to write
   lots
of
 data to few files as opposed to writing lots of data to lots of
  little
 files.  what are the thoughts on this?

 When I try and implement outputStream = hdfs.append(path); there
   doesn't
 seem to be any locking mechanism in place ... or there is and it
   doesn't
 work well enough for many writes per second?

 i have read and seen that the property dfs.support.append is not
   meant
for
 production use.  still, if millions of little files are as good or
   better
 --- or no difference -- to a few massive files then i suppose
 append
isn't
 something i really need.

 I do see a lot of stack traces with messages like:

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
 failed
  to
 create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528
 on
client
 127.0.0.1 because current leaseholder is trying to recreate file.

 i hope this make sense.  still a little bit confused.

 thanks in advance
 -sd

 --
 Sasha Dolgy
 sasha.do...@gmail.com
   
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 



 --
 Sasha Dolgy
 sasha.do...@gmail.com




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: ClassNotFoundException

2009-05-09 Thread jason hadoop
rel is short for the hadoop version you are using 0.18.x, 0.19.x or 0.20.x
etc
You must make all of the required jars available to all of your tasks. You
can either install them all the tasktracker machines and setup the
tasktracker classpath to include them, or distributed them via the
distributed cache.

chapter 5 of my book goes into this in some detail, and is available now as
a download. http://www.apress.com/book/view/9781430219422

On Fri, May 8, 2009 at 4:22 PM, georgep p09...@gmail.com wrote:


 Sorry, I misspell you name, Jason

 George

 georgep wrote:
 
  Hi Joe,
 
  Thank you for the reply, but do I need to include every supporting jar
  file to the application path? What is the -rel-?
 
  George
 
 
  jason hadoop wrote:
 
  1) when running under windows, include the cygwin bin directory in your
  windows path environment variable
  2) eclipse is not so good at submitting supporting jar files, in your
  application lauch path add a -libjars path/hadoop-rel-examples.jar.
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/ClassNotFoundException-tp23441528p23455206.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: large files vs many files

2009-05-09 Thread Sasha Dolgy
Would WritableFactories not allow me to open one outputstream and continue
to write() and sync() ?

Maybe I'm reading into that wrong.  Although UUID would be nice, it would
still leave me in the problem of having lots of little files instead of a
few large files.

-sd

On Sat, May 9, 2009 at 8:37 AM, jason hadoop jason.had...@gmail.com wrote:

 You must create unique file names, I don't believe (but I do not know) that
 the append could will allow multiple writers.

 Are you writing from within a task, or as an external application writing
 into hadoop.

 You may try using UUID,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of
 your
 filename.
 Without knowing more about your goals, environment and constraints it is
 hard to offer any more detailed suggestions.
 You could also have an application aggregate the streams and write out
 chunks, with one or more writers, one per output file.




Re: Huge DataNode Virtual Memory Usage

2009-05-09 Thread Chris Collins

I think it may of been 6676016:

http://java.sun.com/javase/6/webnotes/6u10.html

We were able to repro at the time this through heavy lucene indexing +  
our internal document pre-processing logic that churned a lot of  
objects.  We have still experience similar issues with 10 but much  
rarer.  Maybe going to 13 may shed some light, you could be tickling  
another similar bug but I didnt see anything obvious.


C


On May 9, 2009, at 12:30 AM, Stefan Will wrote:


Chris,

Thanks for the tip ... However I'm already running 1.6_10:

java version 1.6.0_10
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b15, mixed mode)

Do you know of a specific bug # in the JDK bug database that  
addresses this

?

Cheers,
Stefan



From: Chris Collins ch...@scoutlabs.com
Reply-To: core-user@hadoop.apache.org
Date: Fri, 8 May 2009 20:34:21 -0700
To: core-user@hadoop.apache.org core-user@hadoop.apache.org
Subject: Re: Huge DataNode Virtual Memory Usage

Stefan, there was a nasty memory leak in in 1.6.x before 1.6 10.  It
manifested itself during major GC.  We saw this on linux and solaris
and dramatically improved with an upgrade.

C
On May 8, 2009, at 6:12 PM, Stefan Will wrote:


Hi,

I just ran into something rather scary: One of my datanode processes
that
I’m running with –Xmx256M, and a maximum number of Xceiver threads
of 4095
had a virtual memory size of over 7GB (!). I know that the VM size
on Linux
isn’t necessarily equal to the actual memory used, but I wouldn’t
expect it
to be an order of magnitude higher either. I ran pmap on the
process, and it
showed around 1000 thread stack blocks with roughly 1MB each (which
is the
default size on the 64bit JDK). The largest block was 3GB in size
which I
can’t figure out what it is for.

Does anyone have any insights into this ? Anything that can be  
done to

prevent this other than to restart the DFS regularly ?

-- Stefan







Heterogeneous cluster - quadcores/8 cores, Fairscheduler

2009-05-09 Thread Saptarshi Guha
Hello,
Our unit has 5 quad-core machines, running Hadoop. We have a dedicated
Jobtracker/Namenode. Each machine has 32GB ram.
We have the option of buying an 8 core,128GB machine and the question
is would this be useful as a Tasktracker?

A) It can certainly be used as the JobTracker and Namenode
B) I also read the FairScheduler, and mapred.fairscheduler.loadmanager
appears to allow the administrator allow more jobs to run on a given
machine. Thus the new machine might be able to run double the number
of maps and reduces.

Otherwise, if FairScheduler is not an option(or I misunderstood), then
adding the new machine to the cluster would only underutilize it.

Are my inferences correct?
Regards
Saptarshi Guha


Re: Error when start hadoop cluster.

2009-05-09 Thread jason hadoop
looks like you have different versions of the jars, or perhaps a someone has
run ant in one of your installation directories.

On Fri, May 8, 2009 at 7:54 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote:

 Hi all!


 I cannot start hdfs successful. I checked log file and found following
 message:

 2009-05-09 08:17:55,026 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = haris1.asnet.local/192.168.1.180
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.18.2
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
 /
 2009-05-09 08:17:55,302 ERROR org.apache.hadoop.dfs.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /tmp/hadoop-root/dfs/data: namenode namespaceID = 880518114; datanode
 namespaceID = 461026751
at
 org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:226)
at

 org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:141)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:306)
at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3031)
at
 org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2986)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2994)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3116)

 2009-05-09 08:17:55,303 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at haris1.asnet.local/192.168.1.180
 /

 Please help me!


 P/s: I's using Hadoop-0.18.2 and Hbase 0.18.1


 Thanks,

 Best.

 Nguyen




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


[ANN] hbase-0.19.2 available for download

2009-05-09 Thread stack
HBase 0.19.2 is now available for download

 http://hadoop.apache.org/hbase/releases.html

17 issues have been fixed since hbase 0.19.1.   Release notes are available
here: *http://tinyurl.com/p3x2bn* http://tinyurl.com/8xmyx9

Thanks to all who contributed to this release.

At your service,
The HBase Team


Re-Addressing a cluster

2009-05-09 Thread John Kane
I have a situation that I have not been able to find in the mail archives.

I have an active cluster that was built on a private switch with private IP
address space (192.168.*.*)

I need to relocate the cluster into real address space.

Assuming that I have working DNS, is there an issue? Do I just need to be
sure that I utilize hostnames for everything and then be fat, dumb and
happy? Or are IP Addresses tracked by the namenode, etc?

Thanks


Re: Most efficient way to support shared content among all mappers

2009-05-09 Thread Jeff Hammerbacher
Hey,

For a more detailed discussion of how to use memcached for this purpose, see
the paper Low-Latency, High-Throughput Access to Static Global Resources
within the Hadoop Framework:
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009.pdf.

Regards,
Jeff

On Fri, May 8, 2009 at 2:49 PM, jason hadoop jason.had...@gmail.com wrote:

 Most of the people with this need are using some variant of memcached, or
 other distributed hash table.

 On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote:

 
  Hi,
  As a newcomer to Hadoop, I wonder any efficient way to support shared
  content among all mappers. For example, to implement an neural network
  algorithm, I want the NN data structure accessible by all mappers.
  Thanks for your comments!
  - Joe
 
 
 
 


 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Most efficient way to support shared content among all mappers

2009-05-09 Thread jason hadoop
Thanks Jeff!

On Sat, May 9, 2009 at 1:31 PM, Jeff Hammerbacher ham...@cloudera.comwrote:

 Hey,

 For a more detailed discussion of how to use memcached for this purpose,
 see
 the paper Low-Latency, High-Throughput Access to Static Global Resources
 within the Hadoop Framework:
 http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_etal_TR2009.pdf
 .

 Regards,
 Jeff

 On Fri, May 8, 2009 at 2:49 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Most of the people with this need are using some variant of memcached, or
  other distributed hash table.
 
  On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote:
 
  
   Hi,
   As a newcomer to Hadoop, I wonder any efficient way to support shared
   content among all mappers. For example, to implement an neural network
   algorithm, I want the NN data structure accessible by all mappers.
   Thanks for your comments!
   - Joe
  
  
  
  
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Re-Addressing a cluster

2009-05-09 Thread jason hadoop
You should be able to relocate the cluster's IP space by stopping the
cluster, modifying the configuration files, resetting the dns and starting
the cluster.
Be best to verify connectivity with the new IP addresses before starting the
cluster.

to the best of my knowledge the namenode doesn't care about the ip addresses
of the datanodes, only what blocks they report as having. The namenode does
care about loosing contact with a connected datanode,  replicating the
blocks that are now under replicated.

I prefer IP addresses in my configuration files but that is a personal
preference not a requirement.

On Sat, May 9, 2009 at 11:51 AM, John Kane john.k...@kane.net wrote:

 I have a situation that I have not been able to find in the mail archives.

 I have an active cluster that was built on a private switch with private IP
 address space (192.168.*.*)

 I need to relocate the cluster into real address space.

 Assuming that I have working DNS, is there an issue? Do I just need to be
 sure that I utilize hostnames for everything and then be fat, dumb and
 happy? Or are IP Addresses tracked by the namenode, etc?

 Thanks




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals