Re: Hadoop0.20 - Class Not Found exception

2009-06-29 Thread Steve Loughran

Amandeep Khurana wrote:

I'm getting the following error while starting a MR job:

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
oracle.jdbc.driver.OracleDriver
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297)
... 21 more
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at
org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123)
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292)
... 21 more

Interestingly, the relevant jar is bundled into the MR job jar and its also
there in the $HADOOP_HOME/lib directory.

Exactly same thing worked with 0.19.. Not sure what could have changed or I
broke to cause this error...


could be classloader hierarchy; the JDBC driver needs to be at the right 
level. Try preheating the driver by loading it in your own code, then 
jdbc:URLs might work, and take it out of the MR Job JAR


Re: FYI, Large-scale graph computing at Google

2009-06-29 Thread Steve Loughran

Edward J. Yoon wrote:

I just made a wiki page -- http://wiki.apache.org/hadoop/Hambrug --
Let's discuss about the graph computing framework named Hambrug.



ok, first Q, why the Hambrug. To me that's just Hamburg typed wrong, 
which is going to cause lots of confusion.


What about something more graphy? like "descartes"


Re: FYI, Large-scale graph computing at Google

2009-06-29 Thread Steve Loughran

Patterson, Josh wrote:

Steve,
I'm a little lost here; Is this a replacement for M/R or is it some new
code that sits ontop of M/R that runs an iteration over some sort of
graph's vertexes? My quick scan of Google's article didn't seem to yeild
a distinction. Either way, I'd say for our data that a graph processing
lib for M/R would be interesting.



I'm thinking of graph algorithms that get implemented as MR jobs; work 
with HDFS, HBase, etc.


Re: FYI, Large-scale graph computing at Google

2009-06-25 Thread Steve Loughran

mike anderson wrote:

This would be really useful for my current projects. I'd be more than happy
to help out if needed.



well the first bit of code to play with then is this

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/extras/citerank/

the standalone.xml file is the one you want to build and run with, the 
other would require you to check out and build two levels up, but gives 
you the ability to bring up local or remote clusters to test. Call 
run-local to run it locally., which should give you some stats like this:


 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Counters: 11
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool:   File Systems
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local 
bytes read=209445683448
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local 
bytes written=173943642259
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool:   Map-Reduce 
Framework
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
input groups=9985124
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine 
output records=34
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input 
records=24383448
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
output records=16494967
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map 
output bytes=1243216870
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input 
bytes=1528854187
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine 
input records=4528655
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map 
output records=41958636
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
input records=37430015


==
Exiting project "citerank"
==

BUILD SUCCESSFUL - at 25/06/09 17:09
Total time: 9 minutes 1 second

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: FYI, Large-scale graph computing at Google

2009-06-25 Thread Steve Loughran

Edward J. Yoon wrote:

What do you think about another new computation framework on HDFS?

On Mon, Jun 22, 2009 at 3:50 PM, Edward J. Yoon  wrote:

http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
-- It sounds like Pregel seems, a computing framework based on dynamic
programming for the graph operations. I guess maybe they removed the
file communications/intermediate files during iterations.

Anyway, What do you think?


I have a colleague (paolo) who would be interested in adding a set of 
graph algorithms on top of the MR engine


Re: Hadoop 0.20.0, xml parsing related error

2009-06-25 Thread Steve Loughran

Ram Kulbak wrote:

Hi,
The exception is a result of having xerces in the classpath. To resolve,
make sure you are using Java 6 and set the following system property:

-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl


This can also be resolved by the Configuration class(line 1045) making sure
it loads the DocumentBuilderFactory bundled with the JVM and not a 'random'
classpath-dependent factory..
Hope this helps,
Ram


Lovely -I've noted this in the comments of the bugrep


Re: "Too many open files" error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Stas Oskin wrote:

Hi.

So what would be the recommended approach to pre-0.20.x series?

To insure each file is used only by one thread, and then it safe to close
the handle in that thread?

Regards.


good question -I'm not sure. For anythiong you get with 
FileSystem.get(), its now dangerous to close, so try just setting the 
reference to null and hoping that GC will do the finalize() when needed


Re: "Too many open files" error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Raghu Angadi wrote:


Is this before 0.20.0? Assuming you have closed these streams, it is 
mostly https://issues.apache.org/jira/browse/HADOOP-4346


It is the JDK internal implementation that depends on GC to free up its 
cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache.


yes, and it's that change that led to my stack traces :(

http://jira.smartfrog.org/jira/browse/SFOS-1208


Re: "Too many open files" error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Scott Carey wrote:

Furthermore, if for some reason it is required to dispose of any objects after 
others are GC'd, weak references and a weak reference queue will perform 
significantly better in throughput and latency - orders of magnitude better - 
than finalizers.




Good point.

I would make sense for the FileSystem cache to be weak referenced, so 
that on long-lived processes the client references will get cleaned up 
without waiting for app termination


Re: "Too many open files" error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

jason hadoop wrote:

Yes.
Otherwise the file descriptors will flow away like water.
I also strongly suggest having at least 64k file descriptors as the open
file limit.

On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin  wrote:


Hi.

Thanks for the advice. So you advice explicitly closing each and every file
handle that I receive from HDFS?

Regards.


I must disagree somewhat

If you use FileSystem.get() to get your client filesystem class, then 
that is shared by all threads/classes that use it. Call close() on that 
and any other thread or class holding a reference is in trouble. You 
have to wait for the finalizers for them to get cleaned up.


If you use FileSystem.newInstance() - which came in fairly recently 
(0.20? 0.21?) then you can call close() safely.


So: it depends on how you get your handle.

see: https://issues.apache.org/jira/browse/HADOOP-5933

Also: the too many open files problem can be caused in the NN  -you need 
to set up the Kernel to have lots more file handles around. Lots.




Re: Name Node HA (HADOOP-4539)

2009-06-22 Thread Steve Loughran

Andrew Wharton wrote:

https://issues.apache.org/jira/browse/HADOOP-4539

I am curious about the state of this fix. It is listed as
"Incompatible", but is resolved and committed (according to the
comments). Is the backup name node going to make it into 0.21? Will it
remove the SPOF for HDFS? And if so, what is the proposed release
timeline for 0.21?




The way to deal with HA -which the BackupNode doesn't promise- is to get 
involved in developing and testing the leading edge source tree.


The 0.21 cutoff is approaching, BackupNode is in there, but it needs a 
lot more tests. If you want to aid the development, helping to get more 
automated BackupNode tests in there (indeed, tests that simulate more 
complex NN failures, like a corrupt EditLog) would go a long way.


-steve


Re: Hadoop Eclipse Plugin

2009-06-18 Thread Steve Loughran

Praveen Yarlagadda wrote:

Hi,

I have a problem configuring Hadoop Map/Reduce plugin with Eclipse.

Setup Details:

I have a namenode, a jobtracker and two data nodes, all running on ubuntu.
My set up works fine with example programs. I want to connect to this setup
from eclipse.

namenode - 10.20.104.62 - 54310(port)
jobtracker - 10.20.104.53 - 54311(port)

I run eclipse on a different windows m/c. I want to configure map/reduce
plugin
with eclipse, so that I can access HDFS from windows.

Map/Reduce master
Host - With jobtracker IP, it did not work
Port - With jobtracker port, it did not work

DFS master
Host - With namenode IP, It did not work
Port - With namenode port, it did not work

I tried other combination too by giving namenode details for Map/Reduce
master
and jobtracker details for DFS master. It did not work either.


1. check the ports really are open by doing a netstat -a -p on the 
namenode and job tracker ,

netstat -a -p | grep 54310 on the NN
netstat -a -p | grep 54311 on the JT

2l Then, from the windows machine, see if you can connect to them 
oustide ecipse


telnet 10.20.104.62  54310
telnet 10.20.104.53 - 54311

If you can't connect, then firewalls are interfering

If everything works, the problem is in the eclipse plugin (which I don't 
use, and cannot assist with)


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Running Hadoop/Hbase in a OSGi container

2009-06-12 Thread Steve Loughran

Ninad Raut wrote:

OSGi provides navigability to your components and create a life cycle for
each of those components viz; install. start, stop, un- deploy etc.
This is the reason why we are thinking of creating components using OSGi.
The problem we are facing is our components using mapreduce and HDFS, as
such OSGi container cannot detect hadoop mapred engine or HDFS.

I  have searched through the net and looks like people are working or have
achieved success in running hadoop in OSGi container

Ninad



1. I am doing work on a simple lifecycle for the services, 
start/stop/ping, which is not OSGI (which worries a lot about 
classloading and versioning, check out HADOOP-3628 for this.


2. You can run it under OSGi systems, such as the OSGi branch of 
SmartFrog : 
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/branches/core-branch-osgi/, 
or under non-OSGi tools. Either way, these tools are left dealing with 
classloading and the like.


3. Any container is going to have to deal with the problem that there 
are bits of all the services that call System.Exit() by running under a 
security manager, trapping the call, raising an exception etc.


4. Any container is going to have  to then deal with the fact that from 
0.20 onwards, Hadoop does things with security policy that are 
incompatible with normal Java security managers. whatever security 
manager you have for trapping system exits, can't extend the default one.


5. any container also has to deal with every service (namenode, job 
tracker, etc) makes a lot of assumptions about singletons, that they 
have exclusive use of filesystem objects retrieved through 
FileSystem.get(), and the like. While OSGi can do that with its 
classloading work, its still fairly complex.


6. There are also lots of JVM memory/thread management issues, see the 
various Hadoop bugs


If you look at the slides of what I've been up to, you can see that it 
can be done

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/doc/dynamic_hadoop_clusters.ppt

However,
 * you really need to run every service in its own process, for memory 
and reliability alone

 * It's pretty leading edge
 * You will have to invest the time and effort to get it working

If you want to do the work, start with what I've been doing, bring it up 
under the OSGi container of your choice. You can come and play with our 
tooling, I'm cutting a release today of this week's Hadoop trunk merged 
with my branch, it is of course experimental, as even the trunk is a bit 
up-and-down on feature stability.


-steve



Re: Multiple NIC Cards

2009-06-10 Thread Steve Loughran

John Martyniak wrote:
Does hadoop "cache" the server names anywhere?  Because I changed to 
using DNS for name resolution, but when I go to the nodes view, it is 
trying to view with the old name.  And I changed the hadoop-site.xml 
file so that it no longer has any of those values.




in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get the 
value, which is cached for life of the process. I don't know of anything 
else, but wouldn't be surprised -the Namenode has to remember the 
machines where stuff was stored.





Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:
When I run either of those on either of the two machines, it is trying 
to resolve against the DNS servers configured for the external addresses 
for the box.


Here is the result
Server:xxx.xxx.xxx.69
Address:xxx.xxx.xxx.69#53


OK. in an ideal world, each NIC has a different hostname. Now, that 
confuses code that assumes a host has exactly one hostname, not zero or 
two, and I'm not sure how well Hadoop handles the 2+ situation (I know 
it doesn't like 0, but hey, its a distributed application). With 
separate hostnames, you set hadoop up to work on the inner addresses, 
and give out the inner hostnames of the jobtracker and namenode. As a 
result, all traffic to the master nodes should be routed on the internal 
network


Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:

I am running Mac OS X.

So en0 points to the external address and en1 points to the internal 
address on both machines.


Here is the internal results from duey:
en1: flags=8963 
mtu 1500

inet6 fe80::21e:52ff:fef4:65%en1 prefixlen 64 scopeid 0x5
inet 192.168.1.102 netmask 0xff00 broadcast 192.168.1.255
ether 00:1e:52:f4:00:65
media: autoselect (1000baseT ) status: active



lladdr 00:23:32:ff:fe:1a:20:66
media: autoselect  status: inactive
supported media: autoselect 

Here are the internal results from huey:
en1: flags=8863 mtu 1500
inet6 fe80::21e:52ff:fef3:f489%en1 prefixlen 64 scopeid 0x5
inet 192.168.1.103 netmask 0xff00 broadcast 192.168.1.255


what does
  nslookup 192.168.1.103
and
  nslookup 192.168.1.102
say?

There really ought to be different names for them.


> I have some other applications running on these machines, that
> communicate across the internal network and they work perfectly.

I admire their strength. Multihost systems cause us trouble. That and 
machines that don't quite know who they are

http://jira.smartfrog.org/jira/browse/SFOS-5
https://issues.apache.org/jira/browse/HADOOP-3612
https://issues.apache.org/jira/browse/HADOOP-3426
https://issues.apache.org/jira/browse/HADOOP-3613
https://issues.apache.org/jira/browse/HADOOP-5339

One thing to consider is that some of the various services of Hadoop are 
bound to 0:0:0:0, which means every Ipv4 address, you really want to 
bring up everything, including jetty services, on the en0 network 
adapter, by binding them to  192.168.1.102; this will cause anyone 
trying to talk to them over the other network to fail, which at least 
find the problem sooner rather than later




Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:
My original names where huey-direct and duey-direct, both names in the 
/etc/hosts file on both machines.


Are nn.internal and jt.interal special names?


 no, just examples on a multihost network when your external names 
could be something completely different.


What does /sbin/ifconfig say on each of the hosts?



Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:

David,

For the Option #1.

I just changed the names to the IP Addresses, and it still comes up as 
the external name and ip address in the log files, and on the job 
tracker screen.


So option 1 is a no go.

When I change the "dfs.datanode.dns.interface" values it doesn't seem to 
do anything.  When I was search archived mail, this seemed to be a the 
approach to change the NIC card being used for resolution.  But when I 
change it nothing happens, I even put in bogus values and still no issues.


-John



I've been having similar but different fun with Hadoop-on-VMs, there's a 
lot of assumption that DNS and rDNS all works consistently in the code. 
Do you have separate internal and external hostnames? In which case, can 
you bring up the job tracker as jt.internal , namenode as nn.internal 
(so the full HDFS URl is something like hdfs://nn.internal/ ) , etc, etc.?


Re: Every time the mapping phase finishes I see this

2009-06-08 Thread Steve Loughran

Mayuran Yogarajah wrote:
There are always a few 'Failed/Killed Task Attempts' and when I view the 
logs for

these I see:

- some that are empty, ie stdout/stderr/syslog logs are all blank
- several that say:

2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child

java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195)
at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359)

at java.io.FilterInputStream.close(FilterInputStream.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103) 

at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173) 


at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)




Any idea why this happens? I don't understand why I'd be seeing these 
only as the mappers get to

100%.


Seen this when something in the same process got a FileSystem reference 
by FileSystem.get() and then called close() on it -it closes the client 
for every thread/class that has a reference to the same object.



We're planning on adding more diagnostics, by tracking who closed the 
filesystem

https://issues.apache.org/jira/browse/HADOOP-5933


Re: Hadoop scheduling question

2009-06-08 Thread Steve Loughran

Aaron Kimball wrote:


Finally, there's a third scheduler called the Capacity scheduler. It's
similar to the fair scheduler, in that it allows guarantees of minimum
availability for different pools. I don't know how it apportions additional
extra resources though -- this is the one I'm least familiar with. Someone
else will have to chime in here.





There's a dynamic priority scheduler in the patch queue, that I've 
promised to commit this week. Its the one with a notion of currency: you 
pay for your work/priority. At peak times, work costs more

https://issues.apache.org/jira/browse/HADOOP-4768


Re: Monitoring hadoop?

2009-06-07 Thread Steve Loughran

Matt Massie wrote:

Anthony-

The ganglia web site is at http://ganglia.info/ with documentation in a 
wiki at http://ganglia.wiki.sourceforge.net/.  There is also a good wiki 
page at IBM as well 
http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia .  
Ganglia packages are available for most distributions to help with 
installation so make sure to grep for ganglia with your favorite package 
manager (e.g. aptitude, yum, etc).  Ganglia will give you more 
information about your cluster than just Hadoop metrics.  You'll get 
CPU, load, memory, disk and network  monitoring as well for free.


You can see live demos of ganglia at http://ganglia.info/?page_id=69.

Good luck.

-Matt


Out of Modesty, Matt neglects to mention that Ganglia one of his 
projects, so not only does it work well with Hadoop today, I would 
expect the integration to only get better over time.


Anthony -don't forget to feed those stats back into your DFS for later 
analysis...


Re: Hadoop ReInitialization.

2009-06-03 Thread Steve Loughran

b wrote:

But after formatting and starting DFS i need to wait some time (sleep 
60) before putting data into HDFS. Else i will receive 
"NotReplicatedYetException".


that means the namenode is up but there aren't enough workers yet.


Re: question about when shuffle/sort start working

2009-06-01 Thread Steve Loughran

Todd Lipcon wrote:

Hi Jianmin,

This is not (currently) supported by Hadoop (or Google's MapReduce either
afaik). What you're looking for sounds like something more like Microsoft's
Dryad.

One thing that is supported in versions of Hadoop after 0.19 is JVM reuse.
If you enable this feature, task trackers will persist JVMs between jobs.
You can then persist some state in static variables.

I'd caution you, however, from making too much use of this fact as anything
but an optimization. The reason that Hadoop is limited to MR (or M+RM* as
you said) is that simplicity and reliability often go hand in hand. If you
start maintaining important state in RAM on the tasktracker JVMs, and one of
them goes down, you may need to restart your entire job sequence from the
top. In typical MapReduce, you may need to rerun a mapper or a reducer, but
the state is all on disk ready to go.

-Todd




I'd thought the question is not necessarily one of maintaining state, 
but of chaining the output from one job into another, where the # of 
iterations depends on the outcome of the previous set. Funnily enough, 
this is what you (apparently) end up having to do when implementing 
PageRank-like ranking as MR jobs:

http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-29 Thread Steve Loughran

ashish pareek wrote:

Yes I am able to ping and ssh between two virtual machine and even i
have set ip address of both the virtual machines in their respective
/etc/hosts file ...

thanx for reply .. if you suggest some other thing which i could
have missed or any remedy 

Regards,
Ashish Pareek.


VMs? VMWare? Xen? Something else?

I've encountered problems on virtual networks where the machines aren't 
locatable via DNS., and can't be sure who they say they are.


1. start the machines individually, instead of the start-all script that 
needs to have SSH working too.

2. check with netstat -a to see what ports/interfaces they are listening on

-steve



Re: hadoop hardware configuration

2009-05-28 Thread Steve Loughran

Patrick Angeles wrote:

Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.


This is an interesting question. Obviously as an HP employee you must 
assume that I'm biased when I say HP DL160 servers are good  value for 
the workers, though our blade systems are very good for a high physical 
density -provided you have the power to fill up the rack.




2 x Hadoop Master (and Secondary NameNode)

   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
   - 16GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - Hardware RAID controller
   - Redundant Power Supply
   - Approx. 390W power draw (1.9amps 208V)
   - Approx. $4000 per unit


I do not know the what the advantages of that many cores are on a NN. 
Someone needs to do some experiments. I do know you need enough RAM to 
hold the index in memory, and you may want to go for a bigger block size 
to keep the index size down.




6 x Hadoop Task Nodes

   - 1 x 2.3Ghz Quad Core (Opteron 1356)
   - 8GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - No RAID (JBOD)
   - Non-Redundant Power Supply
   - Approx. 210W power draw (1.0amps 208V)
   - Approx. $2000 per unit

I had some specific questions regarding this configuration...




   1. Is hardware RAID necessary for the master node?



You need a good story to ensure that loss of a disk on the master 
doesn't lose the filesystem. I like RAID there, but the alternative is 
to push the stuff out over the network to other storage you trust. That 
could be NFS-mounted RAID storage, it could be NFS mounted JBOD. 
Whatever your chosen design, test it works before you go live by running 
the cluster then simulate different failures, see how well the 
hardware/ops team handles it.


Keep an eye on where that data goes, because when the NN runs out of 
file storage, the consequences can be pretty dramatic (i,e the cluster 
doesnt come up unless you edit the editlog by hand)



   2. What is a good processor-to-storage ratio for a task node with 4TB of
   raw storage? (The config above has 1 core per 1TB of raw storage.)


That really depends on the work you are doing...the bytes in/out to CPU 
work, and the size of any memory structures that are built up over the run.


With 1 core per physical disk, you get the bandwidth of a single disk 
per CPU; for some IO-intensive work you can make the case for two 
disks/CPU -one in, one out, but then you are using more power, and 
if/when you want to add more storage, you have to pull out the disks to 
stick in new ones. If you go for more CPUs, you will probably need more 
RAM to go with it.



   3. Am I better off using dual quads for a task node, with a higher power
   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
   but draws almost 2x as much power. The tradeoffs are:
  1. I will get more CPU per dollar and per watt.
  2. I will only be able to fit 1/2 as much dual quad machines into a
  rack.
  3. I will get 1/2 the storage capacity per watt.
  4. I will get less I/O throughput overall (less spindles per core)


First there is the algorithm itself, and whether you are IO or CPU 
bound. Most MR jobs that I've encountered are fairly IO bound -without 
indexes, every lookup has to stream through all the data, so it's power 
inefficient and IO limited. but if you are trying to do higher level 
stuff than just lookup, then you will be doing more CPU-work


Then there is the question of where your electricity comes from, what 
the limits for the room are, whether you are billed on power drawn or 
quoted PSU draw, what the HVAC limits are, what the maximum allowed 
weight per rack is, etc, etc.


I'm a fan of low Joule work, though we don't have any benchmarks yet of 
the power efficiency of different clusters; the number of MJ used to do 
a a terasort. I'm debating doing some single-cpu tests for this on my 
laptop, as the battery knows how much gets used up by some work.



   4. In planning storage capacity, how much spare disk space should I take
   into account for 'scratch'? For now, I'm assuming 1x the input data size.


That you should probably be able to determine on experimental work on 
smaller datasets. Some maps can throw out a lot of data, most reduces do 
actually reduce the final amount.



-Steve

(Disclaimer: I'm not making any official recommendations for hardware 
here, just making my opinions known. If you do want an official 
recommendation from HP, talk to your reseller or account manager, 
someone will look at your problem in more detail and make some 
suggestions. If you have any code/data that could be shared for 
benchmarking, that would help validate those suggestions)




Re: ssh issues

2009-05-26 Thread Steve Loughran

hmar...@umbc.edu wrote:

Steve,

Security through obscurity is always a good practice from a development
standpoint and one of the reasons why tricking you out is an easy task.


:)

My most recent presentation on HDFS clusters is now online, notice how it
doesn't gloss over the security: 
http://www.slideshare.net/steve_l/hdfs-issues



Please, keep hiding relevant details from people in order to keep everyone
smiling.



HDFS is as secure as NFS: you are trusted to be who you say you are. 
Which means that you have to run it on a secured subnet -access 
restricted to trusted hosts and/or one two front end servers or accept 
that your dataset is readable and writeable by anyone on the network.


There is user identification going in; it is currently at the level 
where it will stop someone accidentally deleting the entire filesystem 
if they lack the rights. Which has been known to happen.


If the team looking after the cluster demand separate SSH keys/login for 
every machine then not only are they making their operations costs high, 
once you have got the HDFS cluster and MR engine live, it's moot. You 
can push out work to the JobTracker, which then runs it on the machines, 
under whatever userid the TaskTrackers are running on. Now,  0.20+ will 
run it under the identity of the user who claimed to be submitting the 
job, but without that, your MR Jobs get the access rights to the 
filesystem of the user that is running the TT, but it's fairly 
straightforward to create a modified hadoop client JAR that doesn't call 
whoami to get the userid, and instead spoofs to be anyone. Which means 
that even if you lock down the filesystem -no out of datacentre access-, 
if I can run my java code as MR jobs in your cluster, I can have 
unrestricted access to the filesystem by way of the task tracker server.


But Hal, if you are running Ant for your build I'm running my code on 
your machines anyway, so you had better be glad that I'm not malicious.


-Steve


Re: ssh issues

2009-05-22 Thread Steve Loughran

Pankil Doshi wrote:

Well i made ssh with passphares. as the system in which i need to login
requires ssh with pass phrases and those systems have to be part of my
cluster. and so I need a way where I can specify -i path/to key/ and
passphrase to hadoop in before hand.

Pankil



Well, are trying to manage a system whose security policy is 
incompatible with hadoop's current shell scripts. If you push out the 
configs and manage the lifecycle using other tools, this becomes a 
non-issue. Dont raise the topic of HDFS security to your ops team 
though, as they will probably be unhappy about what is currently on offer.


-steve


Re: Username in Hadoop cluster

2009-05-21 Thread Steve Loughran

Pankil Doshi wrote:

Hello everyone,

Till now I was using same username on all my hadoop cluster machines.

But now I am building my new cluster and face a situation in which I have
different usernames for different machines. So what changes will have to
make in configuring hadoop. using same username ssh was easy. now will it
face problem as now I have different username?


Are you building these machines up by hand? How many? Why the different 
usernames?


Can't you just create a new user and group "hadoop" on all the boxes?


Re: Optimal Filesystem (and Settings) for HDFS

2009-05-20 Thread Steve Loughran

Bryan Duxbury wrote:
We use XFS for our data drives, and we've had somewhat mixed results. 



Thanks for that. I've just created a wiki page to put some of these 
notes up -extensions and some hard data would be welcome


http://wiki.apache.org/hadoop/DiskSetup

One problem we have for hard data is that we need some different 
benchmarks for MR jobs. Terasort is good for measuring IO and MR 
framework performance, but for more CPU intensive algorithms, or things 
that need to seek round a bit more, you can't be sure that terasort 
benchmarks are a good predictor of what's right for you in terms of 
hardware, filesystem, etc.


Contributions in this area would be welcome.

I'd like to measure the power consumed on a run too, which is actually 
possible as far as my laptop is concerned, because you can ask it's 
battery what happened.


-steve


Re: Suspend or scale back hadoop instance

2009-05-19 Thread Steve Loughran

John Clarke wrote:

Hi,

I am working on a project that is suited to Hadoop and so want to create a
small cluster (only 5 machines!) on our servers. The servers are however
used during the day and (mostly) idle at night.

So, I want Hadoop to run at full throttle at night and either scale back or
suspend itself during certain times.


You could add/remove new task trackers on idle systems, but
* you don't want to take away datanodes, as there's a risk that data 
will become unavailable.
* there's nothing in the scheduler to warn that machines will go away at 
a certain time
If you only want to run the cluster at night, I'd just configure the 
entire cluster to go up and down


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Steve Loughran

Tom White wrote:

On Mon, May 18, 2009 at 11:44 AM, Steve Loughran  wrote:

Grace wrote:

To follow up this question, I have also asked help on Jrockit forum. They
kindly offered some useful and detailed suggestions according to the JRA
results. After updating the option list, the performance did become better
to some extend. But it is still not comparable with the Sun JVM. Maybe, it
is due to the use case with short duration and different implementation in
JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM
currently. Thanks all for your time and help.


what about flipping the switch that says "run tasks in the TT's own JVM?".
That should handle startup costs, and reduce the memory footprint



The property mapred.job.reuse.jvm.num.tasks allows you to set how many
tasks the JVM may be reused for (within a job), but it always runs in
a separate JVM to the tasktracker. (BTW
https://issues.apache.org/jira/browse/HADOOP-3675has some discussion
about running tasks in the tasktracker's JVM).

Tom


Tom,
that's why you are writing a book on Hadoop and I'm not ...you know the 
answers and I have some vague misunderstandings,


-steve
(returning to the svn book)


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Steve Loughran

Grace wrote:

To follow up this question, I have also asked help on Jrockit forum. They
kindly offered some useful and detailed suggestions according to the JRA
results. After updating the option list, the performance did become better
to some extend. But it is still not comparable with the Sun JVM. Maybe, it
is due to the use case with short duration and different implementation in
JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM
currently. Thanks all for your time and help.



what about flipping the switch that says "run tasks in the TT's own 
JVM?". That should handle startup costs, and reduce the memory footprint


Re: Beware sun's jvm version 1.6.0_05-b13 on linux

2009-05-18 Thread Steve Loughran

Allen Wittenauer wrote:



On 5/15/09 11:38 AM, "Owen O'Malley"  wrote:


We have observed that the default jvm on RedHat 5


I'm sure some people are scratching their heads at this.

The default JVM on at least RHEL5u0/1 is a GCJ-based 1.4, clearly
incapable of running Hadoop.  We [and, really, this is my doing... ^.^ ]
replace it with the JVM from the JPackage folks.  So while this isn't the
default JVM that comes from RHEL, the warning should still be heeded. 



Presumably its one of those hard-to-reproduce race conditions that only 
surfaces under load on a big cluster so is hard to replicate in a unit 
test, right?




Re: public IP for datanode on EC2

2009-05-15 Thread Steve Loughran

Tom White wrote:

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.



I thought that AWS had stopped you being able to talk to things within 
the cluster using the public IP addresses -stopped you using DynDNS as 
your way of bootstrapping discovery


Here's what may work
-bring up the EC2 cluster using the local names
-open up the ports
-have the clients talk using the public IP addresses

the problem will arise when the namenode checks the fs name used and it 
doesnt match its expectations -there were some recent patches in the 
code to handle this when someone talks to the namenode using the 
ipaddress instead of the hostname; they may work for this situation too.


personally, I wouldn't trust the NN in the EC2 datacentres to be secure 
to external callers, but that problem already exists within their 
datacentres anyway


Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Yes, I think the JVM uses way more memory than just its heap. Now some of it
might be just reserved memory, but not actually used (not sure how to tell
the difference). There are also things like thread stacks, jit compiler
cache, direct nio byte buffers etc. that take up process space outside of
the Java heap. But none of that should imho add up to Gigabytes...


good article on this
http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/



Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

zsongbo wrote:

Hi Stefan,
Yes, the 'nice' cannot resolve this problem.

Now, in my cluster, there are 8GB of RAM. My java heap configuration is:

HDFS DataNode : 1GB
HBase-RegionServer: 1.5GB
MR-TaskTracker: 1GB
MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)

But the memory usage is still tight.


does TT need to be so big if you are running all your work in external VMs?


Re: Huge DataNode Virtual Memory Usage

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Raghu,

I don't actually have exact numbers from jmap, although I do remember that
jmap -histo reported something less than 256MB for this process (before I
restarted it).

I just looked at another DFS process that is currently running and has a VM
size of 1.5GB (~600 resident). Here jmap reports a total object heap usage
of 120MB. The memory block list reported by jmap  doesn't actually seem
to contain the heap at all since the largest block in that list is 10MB in
size (/usr/java/jdk1.6.0_10/jre/lib/amd64/server/libjvm.so). However, pmap
reports a total usage of 1.56GB.

-- Stefan


you know, if you could get the Task Tracker to include stats on real and 
virtual memory use, I'm sure that others would welcome those reports 
-know that the job was slower and its VM was 2x physical would give you 
a good hint as to the root cause.


Re: Winning a sixty second dash with a yellow elephant

2009-05-12 Thread Steve Loughran

Arun C Murthy wrote:

... oh, and getting it to run a marathon too!

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html 



Owen & Arun


Lovely. I will now stick up the pic of you getting the first results in 
on your laptop at apachecon


Re: Re-Addressing a cluster

2009-05-11 Thread Steve Loughran

jason hadoop wrote:

Now that I think about it, the reverse lookups in my clusters work.


and you have made sure that IPv6 is turned off, right?




Re: datanode replication

2009-05-11 Thread Steve Loughran

Jeff Hammerbacher wrote:

Hey Vishal,

Check out the chooseTarget() method(s) of ReplicationTargetChooser.java in
the org.apache.hadoop.hdfs.server.namenode package:
http://svn.apache.org/viewvc/hadoop/core/trunk/src/hdfs/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java?view=markup
.

In words: assuming you're using the default replication level (3), the
default strategy will put one block on the local node, one on a node in a
remote rack, and another on that same remote rack.

Note that HADOOP-3799 (http://issues.apache.org/jira/browse/HADOOP-3799)
proposes making this strategy pluggable.



Yes, there's some good reasons for having different placement algorithms 
for different datacentres, and I could even imagine different MR 
sequences providing hints about where they want data, depending on what 
they want to do afterwards


Re: Re-Addressing a cluster

2009-05-11 Thread Steve Loughran

jason hadoop wrote:

You should be able to relocate the cluster's IP space by stopping the
cluster, modifying the configuration files, resetting the dns and starting
the cluster.
Be best to verify connectivity with the new IP addresses before starting the
cluster.

to the best of my knowledge the namenode doesn't care about the ip addresses
of the datanodes, only what blocks they report as having. The namenode does
care about loosing contact with a connected datanode,  replicating the
blocks that are now under replicated.

I prefer IP addresses in my configuration files but that is a personal
preference not a requirement.



I do deployments on to Virtual clusters without fully functional reverse 
DNS, things do work badly in that situation. Hadoop assumes that if a 
machine looks up its hostname, it can pass that to peers and they can 
resolve it, the "well managed network infrastructure" assumption.


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-08 Thread Steve Loughran

Grace wrote:

Thanks all for your replying.

I have run several times with different Java options for Map/Reduce
tasks. However there is no much difference.

Following is the example of my test setting:
Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages
-XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch
Test B: -Xmx1024m
Test C: -Xmx1024m -XXaggressive

Is there any tricky or special setting for Jrockit vm on Hadoop?

In the Hadoop Quick Start guides, it says that "JavaTM 1.6.x, preferably
from Sun". Is there any concern about the Jrockit performance issue?



The main thing is that all the big clusters are running (as far as I know),
Linux (probably RedHat) and Sun Java. This is where the performance and 
scale testing is done. If you are willing to spend time doing the 
experiments and tuning, then I'm sure we can update those guides to say 
"JRockit works, here are some options...".


-steve





Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Steve Loughran

Chris Collins wrote:
a couple of years back we did a lot of experimentation between sun's vm 
and jrocket.  We had initially assumed that jrocket was going to scream 
since thats what the press were saying.  In short, what we discovered 
was that certain jdk library usage was a little bit faster with jrocket, 
but for core vm performance such as synchronization, primitive 
operations the sun vm out performed.  We were not taking account of 
startup time, just raw code execution.  As I said, this was a couple of 
years back so things may of changed.


C


I run JRockit as its what some of our key customers use, and we need to 
test things. One lovely feature is tests time out before the stack runs 
out on a recursive operation; clearly different stack management at 
work. Another: no PermGenHeapSpace to fiddle with.


* I have to turn debug logging of in hadoop test runs, or there are 
problems.


* It uses short pointers (32 bits long) for near memory on a 64 bit JVM. 
So your memory footprint on sub-4GB VM images is better. Java7 promises 
this, and with the merger, who knows what we will see. This is 
unimportant  on 32-bit boxes


* debug single stepping doesnt work. That's ok, I use functional tests 
instead :)


I havent looked at outright performance.

/


Re: move tasks to another machine on the fly

2009-05-06 Thread Steve Loughran

Tom White wrote:

Hi David,

The MapReduce framework will attempt to rerun failed tasks
automatically. However, if a task is running out of memory on one
machine, it's likely to run out of memory on another, isn't it? Have a
look at the mapred.child.java.opts configuration property for the
amount of memory that each task VM is given (200MB by default). You
can also control the memory that each daemon gets using the
HADOOP_HEAPSIZE variable in hadoop-env.sh. Or you can specify it on a
per-daemon basis using the HADOOP__OPTS variables in the
same file.

Tom


This looks not so much a VM out of memory problem as OS thread 
provisioning. ulimit may be useful, as is the java -Xss option


http://candrews.integralblue.com/2009/01/preventing-outofmemoryerror-native-thread/



On Wed, May 6, 2009 at 1:28 AM, David Batista  wrote:

I get this error when running Reduce tasks on a machine:

java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:597)
   at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2591)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:387)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117)
   at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
   at 
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)

is it possible to move a reduce task to other machine in the cluster on the fly?

--
./david





Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-06 Thread Steve Loughran

Edward Capriolo wrote:

'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.

Hadoop is scalable, with HOD it is dynamically scalable.

I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for
'utility computing'. as managing the stack and getting started is
quite a complex process.


Exactly. Which is why the Apache Clouds proposal emphasises

-Lightweight front end: low Wattage, stateless nodes for web GUI, bonded 
to the back end


-instrumentation for liveness and load monitoring. Hadoop has a lot of 
this, I'm trying to add more, but we want it everywhere.


-Resource Management: bringing up and tearing down nodes by asking the 
infrastructure. Some Apache projects have done this but only for EC2 and 
only for their layer of the stack. You need something that keeps track 
of everything and acts in your interests, not those of the datacentre 
provider


-Packaging for fully automated install/deploy on Linux systems (=rpm and 
deb)


-A development process in which the tools push the code out to a 
targeted infrastracture even for test runs


Hadoop and friends are part of this, they are a very interesting 
foundation, but they are only part of the storing


Also this stack is best running on LAN network with high speed
interlinks. Historically the "Cloud" is composed of WAN links. An
implication of Cloud Computing is that different services would be
running in different geographical locations which is not how hadoop is
normally deployed.

I believe 'Apache Grid Stack' would be a more fitting.

http://en.wikipedia.org/wiki/Grid_computing

Grid computing (or the use of computational grids) is the application
of several computers to a single problem at the same time — usually to
a scientific or technical problem that requires a great number of
computer processing cycles or access to large amounts of data.


Classic Grid computing - OGSi/OGSA is something I want to steer clear 
of. Historically, you end up in WS-* and computer management politics. 
Furthermore, OGSA never had a good use case except "rewrite your apps 
for the cloud and they will be better". They (lets be fair, we) also 
focused too much on CPU scheduling, not on storage.



Grid computing via the Wikipedia definition describes exactly what
hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a
'cloud computing' IMHO


To be precise: without a dynamic infrastructure provider that is more 
than just AWS: it could be Sun/Oracle, IBM/google, HP/Intel/Yahoo!, it 
could be your ops team and Eucalyptus.


The other hardware/service vendors are working on this infrastructure. 
Apache doesn't work at that level, but if we provide the code to run on 
all of them, we give the users the independence of a particular 
infrastructure provider


Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Steve Loughran

Bradford Stephens wrote:

Hey all,

I'm going to be speaking at OSCON about my company's experiences with
Hadoop and Friends, but I'm having a hard time coming up with a name
for the entire software ecosystem. I'm thinking of calling it the
"Apache CloudStack". Does this sound legit to you all? :) Is there
something more 'official'?


We've been using "Apache Cloud Computing Edition" for this, to emphasise 
this is the successor to Java Enterprise Edition, and that it is cross 
language and being built at apache. If you use the same term, even if 
you put a different stack outline than us, it gives the idea more 
legitimacy.


The slides that Andrew linked to are all in SVN under
http://svn.apache.org/repos/asf/labs/clouds/

we have a space in the apache labs for "apache clouds", where we want to 
do more work integrating things, and bringing the idea of deploy and 
test on someone else's infrastructure mainstream across all the apache 
products. We would welcome your involvement -and if you send a draft of 
your slides out, will happily review them


-steve


Re: I need help

2009-04-30 Thread Steve Loughran

Razen Alharbi wrote:

Thanks everybody,

The issue was that hadoop writes all the outputs to stderr instead of stdout
and i don't know why. I would really love to know why the usual hadoop job
progress is written to stderr.


because there is a line in log4.properties telling it to do just that?

log4j.appender.console.target=System.err


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Can i make a node just an HDFS client to put/get data into hadoop

2009-04-29 Thread Steve Loughran

Usman Waheed wrote:

Hi All,

Is it possible to make a node just a hadoop client so that it can 
put/get files into HDFS but not act as a namenode or datanode?
I already have a master node and 3 datanodes but need to execute 
puts/gets into hadoop in parallel using more than just one machine other 
than the master.




Anything on the LAN can be a client of the filesystem, you just need 
appropriate hadoop configuration files to talk to the namenode and job 
tracker. I don't know how well the (custom) IPC works over long 
distances, and you have to keep the versions in sync for everything to 
work reliably.


Re: programming java ee and hadoop at the same time

2009-04-29 Thread Steve Loughran

Bill Habermaas wrote:
George, 


I haven't used the Hadoop perspective in Eclipse so I can't help with that
specifically but map/reduce is a batch process (and can be long running). In
my experience, I've written servlets that write to HDFS and then have a
background process perform the map/reduce. They can both run in background
under Eclipse but are not tightly coupled. 



I've discussed this recently, having a good binding from webapps to the 
back end, where the back end consists of HDFS, MapReduce queues, and 
things round them


https://svn.apache.org/repos/asf/labs/clouds/src/doc/apache_cloud_computing_edition_oxford.odp

If people are willing to help with this, we have an apache "lab" 
project, Apache Clouds, ready for you code, tests and ideas


Re: I need help

2009-04-28 Thread Steve Loughran

Razen Al Harbi wrote:

Hi all,

I am writing an application in which I create a forked process to execute a 
specific Map/Reduce job. The problem is that when I try to read the output 
stream of the forked process I get nothing and when I execute the same job 
manually it starts printing the output I am expecting. For clarification I will 
go through the simple code snippet:


Process p = rt.exec("hadoop jar GraphClean args");
BufferedReader reader = new BufferedReader(new 
InputStreamReader(p.getInputStream()));
String line = null;
check = true;
while(check){
line = reader.readLine();
if(line != null){// I know this will not finish it's only for testing.
System.out.println(line);
} 
}


If I run this code nothing shows up. But if execute the command (hadoop jar 
GraphClean args) from the command line it works fine. I am using hadoop 0.19.0.



Why not just invoke the Hadoop job submission calls yourself, no need to 
exec anything?


Look at org.apache.hadoop.util.RunJar to see what you need to do.

Avoid calling RunJar.main() directly as
 - it calls System.exit() when it wants to exit with an error
 - it adds shutdown hooks

-steve


Re: Storing data-node content to other machine

2009-04-28 Thread Steve Loughran

Vishal Ghawate wrote:

Hi,
I want to store the contents of all the client machine(datanode)of hadoop 
cluster to centralized machine
 with high storage capacity.so that tasktracker will be on the client machine 
but the contents are stored on the
centralized machine.
Can anybody help me on this please.



set the datanode to point to the (mounted) filesystem with the 
dfs.data.dir parameter.




Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question

2009-04-28 Thread Steve Loughran

Aaron Kimball wrote:

I'm not aware of any documentation about this particular use case for
Hadoop. I think your best bet is to look into the JNI documentation about
loading native libraries, and go from there.
- Aaron


You could also try

1. Starting the main processing app as a process on the machines -and 
leave it running-


2. have your mapper (somehow) talk to that running process, passing in 
parameters (including local filesystem filenames) to read and write.


You can use RMI or other IPC mechanisms to talk to the long-lived process.



Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Steve Loughran

Stas Oskin wrote:

Hi.

2009/4/23 Matt Massie 


Just for clarity: are you using any type of virtualization (e.g. vmware,
xen) or just running the DataNode java process on the same machine?

What is "fs.default.name" set to in your hadoop-site.xml?




 This machine has OpenVZ installed indeed, but all the applications run
withing the host node, meaning all Java processes are running withing same
machine.


Maybe, but there will still be at least one virtual network adapter on 
the host. Try turning them off.




The fs.default.name is:
hdfs://192.168.253.20:8020


what happens if you switch to hostnames over IP addresses?


Re: No route to host prevents from storing files to HDFS

2009-04-22 Thread Steve Loughran

Stas Oskin wrote:

Hi again.

Other tools, like balancer, or the web browsing from namenode, don't work as
well.
This because other nodes complain about not reaching the offending node as
well.

I even tried netcat'ing the IP/port from another node - and it successfully
connected.

Any advice on this "No route to host" error?


"No route to host" generally means machines have routing problems. 
Machine A doesnt know how to route packets to Machine B. Reboot 
everything, router first, see if it goes away. Otherwise, now is the 
time to learn to debug routing problems. traceroute is the best starting 
place


Re: Error reading task output

2009-04-21 Thread Steve Loughran

Aaron Kimball wrote:

Cam,

This isn't Hadoop-specific, it's how Linux treats its network configuration.
If you look at /etc/host.conf, you'll probably see a line that says "order
hosts, bind" -- this is telling Linux's DNS resolution library to first read
your /etc/hosts file, then check an external DNS server.

You could probably disable local hostfile checking, but that means that
every time a program on your system queries the authoritative hostname for
"localhost", it'll go out to the network. You'll probably see a big
performance hit. The better solution, I think, is to get your nodes'
/etc/hosts files squared away.


I agree


You only need to do so once :)


No, you need to detect whenever the Linux networking stack has decided 
to add new entries to resolv.conf  or /etc/hosts and detect when they 
are inappropriate. Which is a tricky thing to do as there are some cases 
where you may actually be grateful that someone in the debian codebase 
decided that adding the local hostname as 127.0.0.1 is actually a 
feature. I ended up writing a new SmartFrog component that can be 
configured to fail to start if the network is a mess, which is something 
worth pushing out.


as part of hadoop diagnostics, this test would be one of the things to 
deal with and at least warn on. "your hostname is local, you will not be 
visible over the network".


-steve



Re: getting DiskErrorException during map

2009-04-21 Thread Steve Loughran

Jim Twensky wrote:

Yes, here is how it looks:


hadoop.tmp.dir
/scratch/local/jim/hadoop-${user.name}


so I don't know why it still writes to /tmp. As a temporary workaround, I
created a symbolic link from /tmp/hadoop-jim to /scratch/...
and it works fine now but if you think this might be a considered as a bug,
I can report it.


I've encountered this somewhere too; could be something is using the 
java temp file API, which is not what you want. Try setting 
java.io.tmpdir to /scratch/local/tmp just to see if that makes it go away





Re: Error reading task output

2009-04-21 Thread Steve Loughran

Cam Macdonell wrote:



Well, for future googlers, I'll answer my own post.  Watch our for the 
hostname at the end of "localhost" lines on slaves.  One of my slaves 
was registering itself as "localhost.localdomain" with the jobtracker.


Is there a way that Hadoop could be made to not be so dependent on 
/etc/hosts, but on more dynamic hostname resolution?




DNS is trouble in Java; there are some (outstanding) bugreps/hadoop 
patches on the topic, mostly showing up on a machine of mine with a bad 
hosts entry. I also encountered some fun last month with ubuntu linux 
adding the local hostname to /etc/hosts along the 127.0.0.1 entry, which 
is precisely what you dont want for a cluster of vms with no DNS at all. 
This sounds like your problem too, in which case I have shared your pain


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52



Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-21 Thread Steve Loughran

Andrew Newman wrote:

They are comparing an indexed system with one that isn't.  Why is
Hadoop faster at loading than the others?  Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does.  Who want to write a paper for next year, "grep vs reverse
index"?

2009/4/15 Guilherme Germoglio :

(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/



I think it is interesting, though it misses the point that the reason 
that few datasets are >1PB today is nobody could afford to store or 
process the data. With Hadoop cost is somewhat high (learn to patch the 
source to fix your cluster's problems) but scales well with the #of 
nodes.  Commodity storage costs (my own home now has >2TB of storage) 
and commodity software costs are compatible.


Some other things to look at

-power efficiency. I actually think the DBs could come out better
-ease of writing applications by skilled developers. Pig vs SQL
-performance under different workloads (take a set of log files growing 
continually, mine it in near-real time. I think the last.fm use case 
would be a good one)



One of the great ironies of SQL is most developers dont go near it, as 
it is a detail handed by the O/R mapping engine, except when building 
SQL selects for web pages. If Pig makes M/R easy, would it be used -and 
if so, does that show that we developers prefer procedural thinking?


-steve





Re: How many people is using Hadoop Streaming ?

2009-04-21 Thread Steve Loughran

Tim Wintle wrote:

On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:

  1) I can pick the language that offers a different programming
paradigm (e.g. I may choose functional language, or logic programming
if they suit the problem better).  In fact, I can even chosen Erlang
at the map() and Prolog at the reduce().  Mix and match can optimize
me more.


Agreed (as someone who has written mappers/reducers in Python, perl,
shell script and Scheme before).



sounds like a good argument for adding scripting support for in-JVM MR 
jobs; use the java6 scripting APIs and use any of the supported 
languages -java script out the box, other languages (jython, scala) with 
the right JARs.


Re: RPM spec file for 0.19.1

2009-04-21 Thread Steve Loughran

Ian Soboroff wrote:

Steve Loughran  writes:


I think from your perpective it makes sense as it stops anyone getting
itchy fingers and doing their own RPMs. 


Um, what's wrong with that?



It's reallly hard to do good RPM spec files. If cloudera are willing to 
pay Matt to do it, not only do I welcome it, but will see if I can help 
him with some of the automated test setup they'll need.


One thing which would be useful would be to package up all of the hadoop 
functional tests that need a live cluster up as its own JAR, so the test 
suite could be run against an RPM installation on different Virtual 
OS/JVM combos. I've just hit this problem with my own RPMs on Java6 
(java security related), so know that having the ability to use the 
entire existing test suite against an RPM installation would be be 
beneficial (both in my case and for hadoop RPMS)


Re: Amazon Elastic MapReduce

2009-04-03 Thread Steve Loughran

Brian Bockelman wrote:


On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:

seems like I should pay for additional money, so why not configure a 
hadoop

cluster in EC2 by myself. This already have been automatic using script.




Not everyone has a support team or an operations team or enough time to 
learn how to do it themselves.  You're basically paying for the fact 
that the only thing you need to know to use Hadoop is:

1) Be able to write the Java classes.
2) Press the "go" button on a webpage somewhere.

You could use Hadoop with little-to-zero systems knowledge (and without 
institutional support), which would always make some researchers happy.


Brian


True, but this way nobody gets the opportunity to learn how to do it 
themselves, which can be a tactical error one comes to regret further 
down the line. By learning the pain of cluster management today, you get 
to keep it under control as your data grows.


I am curious what bug patches AWS will supply, for they have been very 
silent on their hadoop work to date.


Re: Using HDFS to serve www requests

2009-04-03 Thread Steve Loughran

Snehal Nagmote wrote:

can you please explain exactly adding NIO bridge means what and how it can be
done , what could 
be advantages in this case ?  


NIO: java non-blocking IO. It's a standard API to talk to different 
filesystems; support has been discussed in jira. If the DFS APIs were 
accessible under an NIO front end, then applications written for the NIO 
APIs would work with the supported filesystems, with no need to code 
specifically for hadoop's not-yet-stable APIs







Steve Loughran wrote:

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.







Re: RPM spec file for 0.19.1

2009-04-03 Thread Steve Loughran

Christophe Bisciglia wrote:

Hey Ian, we are totally fine with this - the only reason we didn't
contribute the SPEC file is that it is the output of our internal
build system, and we don't have the bandwidth to properly maintain
multiple RPMs.

That said, we chatted about this a bit today, and were wondering if
the community would like us to host RPMs for all releases in our
"devel" repository. We can't stand behind these from a reliability
angle the same way we can with our "blessed" RPMs, but it's a
manageable amount of additional work to have our build system spit
those out as well.



I think from your perpective it makes sense as it stops anyone getting 
itchy fingers and doing their own RPMs. At the same time, I think we do 
need to make it possible/easy to do RPMs *and have them consistent*. If 
hadoop-core makes RPMs that don't work with your settings rpms, you get 
to field to the support calls.


-steve


Re: RPM spec file for 0.19.1

2009-04-03 Thread Steve Loughran

Ian Soboroff wrote:

I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
with a spec file for building a 0.19.1 RPM.

I like the idea of Cloudera's RPM file very much.  In particular, it has
nifty /etc/init.d scripts and RPM is nice for managing updates.
However, it's for an older, patched version of Hadoop.

This spec file is actually just Cloudera's, with suitable edits.  The
spec file does not contain an explicit license... if Cloudera have
strong feelings about it, let me know and I'll pull the JIRA attachment.

The JIRA includes instructions on how to roll the RPMs yourself.  I
would have attached the SRPM but they're too big for JIRA.  I can offer
noarch RPMs build with this spec file if someone wants to host them.

Ian



-RPM and deb packaging would be nice

-the .spec file should be driven by ant properties to get dependencies 
from the ivy files
-the jdk requirements are too harsh as it should run on openjdk's JRE or 
jrockit; no need for sun only. Too bad the only way to say that is leave 
off all jdk dependencies.
-I worry about how they patch the rc.d files. I can see why, but wonder 
what that does with the RPM ownership


As someone whose software does get released as RPMs (and tar files 
containing everything needed to create your own), I can state with 
experience that RPMs are very hard to get right, and very hard to test. 
The hardest thing to get right (and to test) is live update of the RPMs 
while the app is running. I am happy for the cloudera team to have taken 
on this problem.


Re: Typical hardware configurations

2009-03-31 Thread Steve Loughran

Scott Carey wrote:

On 3/30/09 4:41 AM, "Steve Loughran"  wrote:


Ryan Rawson wrote:


You should also be getting 64-bit systems and running a 64 bit distro on it
and a jvm that has -d64 available.

For the namenode yes. For the others, you will take a fairly big memory
hit (1.5X object size) due to the longer pointers. JRockit has special
compressed pointers, so will JDK 7, apparently.



Sun Java 6 update 14 has ³Ordinary Object Pointer² compression as well.
-XX:+UseCompressedOops.  I¹ve been testing out the pre-release of that with
great success.


Nice. Have you tried Hadoop with it yet?



Jrockit has virtually no 64 bit overhead up to 4GB, Sun Java 6u14 has small
overhead up to 32GB with the new compression scheme.  IBM¹s VM also has some
sort of pointer compression but I don¹t have experience with it myself.


I use the JRockit JVM as it is what our customers use and we need to 
test on the same JVM. It is interesting in that recursive calls don't 
ever seem to run out; the way it does stack doesn't have separate memory 
spaces for stack, permanent generation heap space and the like.



That doesn't mean apps are light: a freshly started IDE consumes more 
physical memory than a VMWare image running XP and outlook. But it is 
fairly responsive, which is good for a UI:

2295m 650m  22m S2 10.9   0:43.80 java
855m 543m 530m S   11  9.1   4:40.40 vmware-vmx




http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/tag/compressed-oops/
 
With pointer compression, there may be gains to be had with running 64 bit

JVMs smaller than 4GB on x86 since then the runtime has access to native 64
bit integer operations and registers (as well as 2x the register count).  It
will be highly use-case dependent.


that would certainly benefit atomic operations on longs; for floating 
point math it would be less useful as JVMs have long made use of the SSE 
register set for FP work. 64 bit registers would make it easier to move 
stuff in and out of those registers.


I will try and set up a hudson server with this update and see how well 
it behaves.


Re: Typical hardware configurations

2009-03-30 Thread Steve Loughran

Ryan Rawson wrote:


You should also be getting 64-bit systems and running a 64 bit distro on it
and a jvm that has -d64 available.


For the namenode yes. For the others, you will take a fairly big memory 
hit (1.5X object size) due to the longer pointers. JRockit has special 
compressed pointers, so will JDK 7, apparently.





Re: Using HDFS to serve www requests

2009-03-30 Thread Steve Loughran

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.



If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.


Re: virtualization with hadoop

2009-03-30 Thread Steve Loughran

Oliver Fischer wrote:

Hello Vishal,

I did the same some weeks ago. The most important fact is, that it
works. But it is horrible slow if you not have enough ram and multiple
disks since all I/o-Operations go to the same disk.


they may go to separate disks underneath, but performance is bad as what 
the virtual OS thinks is a raw hard disk could be a badly fragmented bit 
of storage on the container OS.


Memory is another point of conflict; your VMs will swap out or block 
other vms.


0. Keep different VM virtual disks on different physical disks. Fast 
disks at that.

1. pre-allocate your virtual disks
2. defragment at both the VM and host OS levels.
3. Crank back the schedulers so that the VMs aren't competing too much 
for CPU time. One core for the host OS, one for each VM.
4. You can keep an eye on performance by looking at the clocks of the 
various machines: if they pause and get jittery then they are being 
swapped out.


Using multiple VMs on a single host is OK for testing, but not for hard 
work. You can use VM images to do work, but you need to have enough 
physical cores and RAM to match that of the VMs.


-steve







Re: JNI and calling Hadoop jar files

2009-03-30 Thread Steve Loughran

jason hadoop wrote:

The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*,
implies strongly that a hadoop-default.xml file, or at least a  job.xml file
is present.
Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the
assumption is that the core jar is available.
The class not found exception, the implication is that the
hadoop-0.X.Y-core.jar is not available to jni.

Given the above constraints, the two likely possibilities are that the -core
jar is unavailable or damaged, or that somehow the classloader being used
does not have access to the -core  jar.

A possible reason for the jar not being available is that the application is
running on a different machine, or as a different user and the jar is not
actually present or perhaps readable in the expected location.





Which way is your JNI, java application calling into a native shared
library, or a native application calling into a jvm that it instantiates via
libjvm calls?

Could you dump the classpath that is in effect before your failing jni call?
System.getProperty( "java.class.path"), and for that matter,
"java.library.path", or getenv("CLASSPATH)
and provide an ls -l of the core.jar from the class path, run as the user
that owns the process, on the machine that the process is running on.



Or something bad is happening with a dependent library of the filesystem 
that is causing the reflection-based load to fail and die with the root 
cause being lost in the process. Sometimes putting an explicit reference 
to the class you are trying to load is a good way to force the problem 
to surface earlier, and fail with better error messages.


Re: Coordination between Mapper tasks

2009-03-20 Thread Steve Loughran

Stuart White wrote:

The nodes in my cluster have 4 cores & 4 GB RAM.  So, I've set
mapred.tasktracker.map.tasks.maximum to 3 (leaving 1 core for
"breathing room").

My process requires a large dictionary of terms (~ 2GB when loaded
into RAM).  The terms are looked-up very frequently, so I want the
terms memory-resident.

So, the problem is, I want 3 processes (to utilize CPU), but each
process requires ~2GB, but my nodes don't have enough memory to each
have their own copy of the 2GB of data.  So, I need to somehow share
the 2GB between the processes.

What I have currently implemented is a standalone RMI service that,
during startup, loads the 2GB dictionaries.  My mappers are simply RMI
clients that call this RMI service.

This works just fine.  The only problem is that my standalone RMI
service is totally "outside" Hadoop.  I have to ssh onto each of the
nodes, start/stop/reconfigure the services manually, etc...


There's nothing wrong with doing this outside Hadoop, the only problem 
is that manual deployment is not the way forward.


1. some kind of javaspace system where you put facts into the t-space 
and let them all share it


2. (CofI warning), use something like SmartFrog's anubis tuplespace to 
bring up one -and one only- node with the dictionary application. This 
may be hard to get started, but it keeps availability high -the anubis 
nodes keep track of all other members of the cluster by some 
heartbeat/election protocol, and can handle failures of the dictionary 
node by automatically bringing up a new one


3. Roll your own multicast/voting protocol, so avoiding RMI. Something 
scatter/gather style is needed as part of the Apache Cloud computing 
product portfolio, so you could try implementing it -Doug Cutting will 
probably provide constructive feedback.


I haven't played with zookeeper enough to say whether it would work here

-steve


Re: using virtual slave machines

2009-03-12 Thread Steve Loughran

Karthikeyan V wrote:

There is no specific procedure for configuring virtual machine slaves.



make sure the following thing are done.



I've used these as the beginning of a page on this

http://wiki.apache.org/hadoop/VirtualCluster




Re: Extending ClusterMapReduceTestCase

2009-03-12 Thread Steve Loughran

jason hadoop wrote:

I am having trouble reproducing this one. It happened in a very specific
environment that pulled in an alternate sax parser.

The bottom line is that jetty expects a parser with particular capabilities
and if it doesn't get one, odd things happen.

In a day or so I will have hopefully worked out the details, but it has been
have a year since I dealt with this last.

Unless you are forking, to run your junit tests, ant won't let you change
the class path for your unit tests - much chaos will ensue.


Even if you fork, unless you set includeantruntime=false then you get 
Ant's classpath, as the junit test listeners are in the 
ant-optional-junit.jar and you'd better pull them in somehow.


I can see why AElfred would cause problems for jetty; they need to 
handle web.xml and suchlike, and probably validate them against the 
schema to reduce support calls.




Re: Persistent HDFS On EC2

2009-03-12 Thread Steve Loughran

Kris Jirapinyo wrote:

Why would you lose the locality of storage-per-machine if one EBS volume is
mounted to each machine instance?  When that machine goes down, you can just
restart the instance and re-mount the exact same volume.  I've tried this
idea before successfully on a 10 node cluster on EC2, and didn't see any
adverse performance effects--



I was thinking more of S3 FS, which is remote-ish and write times measurable


and actually amazon claims that EBS I/O should
be even better than the instance stores.  


Assuming the transient filesystems are virtual disks (and not physical 
disks that get scrubbed, formatted and mounted on every VM 
instantiation), and also assuming that EBS disks are on a SAN in the 
same datacentre, this is probably true. Disk IO performance in virtual 
disks is currently pretty slow as you are navigating through >1 
filesystem, and potentially seeking at lot, even something that appears 
unfragmented at the VM level





The only concerns I see are that

you need to pay for EBS storage regardless of whether you use that storage
or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
starting out with your cluster so you're using only 50GB on each EBS volume
so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
and that could be a hefty price for each month.  Also, currently EBS needs
to be created in the same availability zone as your instances, so you need
to make sure that they are created correctly, as there is no direct
migration of EBS to different availability zones.


View EBS as renting space in SAN and it starts to make sense.


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Persistent HDFS On EC2

2009-03-11 Thread Steve Loughran

Malcolm Matalka wrote:

If this is not the correct place to ask Hadoop + EC2 questions please
let me know.

 


I am trying to get a handle on how to use Hadoop on EC2 before
committing any money to it.  My question is, how do I maintain a
persistent HDFS between restarts of instances.  Most of the tutorials I
have found involve the cluster being wiped once all the instances are
shut down but in my particular case I will be feeding output of a
previous days run as the input of the current days run and this data
will get large over time.  I see I can use s3 as the file system, would
I just create an EBS  volume for each instance?  What are my options?


 EBS would cost you more; you'd lose the locality of storage-per-machine.

If you stick the output of some runs back into S3 then the next jobs 
have no locality and higher startup overhead to pull the data down, but 
you dont pay for that download (just the time it takes).


Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread Steve Loughran

jason hadoop wrote:

The other goofy thing is that the  xml parser that is commonly first in the
class path, validates xml in a way that is opposite to what jetty wants.


What does ant -diagnostics say? It will list the XML parser at work



This line in the preamble before theClusterMapReduceTestCase setup takes
care of the xml errors.

System.setProperty("javax.xml.parsers.SAXParserFactory","org.apache.xerces.jaxp.SAXParserFactoryImpl");



possibly, though when Ant starts it with the classpath set up for junit 
runners, I'd expect the xml parser from the ant distro to get in there 
first, system properties notwithstandng


Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread Steve Loughran

Philip Zeyliger wrote:

Very naively looking at the stack traces, a common theme is that there's a
call out to "df" to find the system capacity.  If you see two data node
processes, perhaps the fork/exec to call out to "df" is failing in some
strange way.


that's deep into Java code. OpenJDK gives you more of that source. One 
option here is to consider some kind of timeouts in the exec, but it's 
pretty tricky to tack that on round the Java runtime APIs, because the 
process APIs weren't really designed to be interrupted by other threads.


-steve



"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x002ae2c0d400 nid=0x21cf in Object.wait()
[0x42c54000..0x42c54b30]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
- locked <0x002a9fd84f98> (a java.lang.UNIXProcess$Gate)
at java.lang.UNIXProcess.(UNIXProcess.java:145)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
- locked <0x002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
at java.lang.Thread.run(Thread.java:619)



On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote:


On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
fine with no network/memory problems, but in every instance I see two
DataNode processes running, and the NameNode logs indicate the datanode in
question simply stopped responding. This state persists until I come along
and kill the DataNode processes and restart the DataNode on that particular
machine.

I'm at a loss as to why this is happening, so here's all the relevant
information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom patches
running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
respectively

I managed to grab a stack dump via "kill -3" from two of these problem
instances and threw up the logs at
http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
The .log files honestly show nothing out of the ordinary, and having very
little Java developing experience the .out files mean nothing to me. It's
also worth mentioning that the NameNode logs at the time when these
DataNodes got stuck show nothing out of the ordinary either -- just the
expected "lost heartbeat from node " message. The DataNode daemon (the
original process, not the second mysterious one) continues to respond to web
requests like browsing the log directory during this time.

Whenever this happens I've just manually done a "kill -9" to remove the two
stuck DataNode processes (I'm not even sure why there's two of them, as
under normal operation there's only one). After killing the stuck ones, I
simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
not seen any dataloss or corruption as a result of this problem.

Has anyone seen anything like this happen before? Out of our ~100 node
cluster I see this problem around once a day, and it seems to just strike
random nodes at random times. It happens often enough that I would be happy
to do additional debugging if anyone can tell me how. I'm not a developer at
all, so I'm at the end of my knowledge on how to solve this problem. Thanks
for any help!


===
Garhan Attebury
Systems Administrator
UNL Research Computing Facility
402-472-7761
===







--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: master trying fetch data from slave using "localhost" hostname :)

2009-03-09 Thread Steve Loughran

pavelkolo...@gmail.com wrote:
On Fri, 06 Mar 2009 14:41:57 -, jason hadoop 
 wrote:


I see that when the host name of the node is also on the localhost 
line in

/etc/hosts



I erased all records with "localhost" from all "/etc/hosts" files and 
all fine now :)

Thank you :)



what does /etc/host look like now?

I hit some problems with ubuntu and localhost last week; the hostname 
was set up in /etc/hosts not just to point to the loopback address, but 
to a different loopback address (127.0.1.1) from the normal value 
(127.0.0.1), so breaking everything.


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52


Re: Running 0.19.2 branch in production before release

2009-03-05 Thread Steve Loughran

Aaron Kimball wrote:

I recommend 0.18.3 for production use and avoid the 19 branch entirely. If
your priority is stability, then stay a full minor version behind, not just
a revision.


Of course, if everyone stays that far behind, they don't get to find the 
bugs for other people.


* If you play with the latest releases early, while they are in the beta 
phase -you will encounter the problems specific to your 
applications/datacentres, and get them fixed fast.


* If you work with stuff further back you get stability, but not only 
are you behind on features, you can't be sure that all "fixes" that 
matter to you get pushed back.


* If you plan on making changes, of adding features, get onto SVN_HEAD

* If you want to catch changes being made that break your site, 
SVN_HEAD. Better yet, have a private Hudson server checking out SVN_HEAD 
hadoop *then* building and testing your app against it.


Normally I work with stable releases of things I dont depend on, and 
SVN_HEAD of OSS stuff whose code I have any intent to change; there is a 
price -merge time, the odd change breaking your code- but you get to 
make changes that help you long term.


Where Hadoop is different is that it is a filesystem, and you don't want 
to hit bugs that delete files that matter. I'm only bringing up 
transient clusters on VMs, pulling in data from elsewhere, so this isn't 
an issue. All that remains is changing APIs.


-Steve


Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Steve Loughran
(Unknown Source)

at org.eclipse.core.internal.events.BuildManager$1.run(Unknown Source)

at org.eclipse.core.runtime.SafeRunner.run(Unknown Source)

at org.eclipse.core.internal.events.BuildManager.basicBuild(Unknown Source)

at org.eclipse.core.internal.events.BuildManager.basicBuild(Unknown Source)

at org.eclipse.core.internal.events.BuildManager.build(Unknown Source)

at org.eclipse.core.internal.resources.Project.internalBuild(Unknown Source)

at org.eclipse.core.internal.resources.Project.build(Unknown Source)

at org.eclipse.ui.actions.BuildAction.invokeOperation(Unknown Source)

at org.eclipse.ui.actions.WorkspaceAction.execute(Unknown Source)

at org.eclipse.ui.actions.WorkspaceAction$2.runInWorkspace(Unknown Source)

at org.eclipse.core.internal.resources.InternalWorkspaceJob.run(Unknown
Source)

at org.eclipse.core.internal.jobs.Worker.run(Unknown Source)

Caused by: *org.apache.commons.logging.LogConfigurationException*:
java.lang.NoClassDefFoundError: org.apache.log4j.Category (Caused by
java.lang.NoClassDefFoundError: org.apache.log4j.Category)

at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(Unknown
Source)

at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(Unknown
Source)

at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(Unknown
Source)

at org.apache.commons.logging.LogFactory.getLog(Unknown Source)

at org.apache.jasper.compiler.Compiler.(Unknown Source)

at java.lang.J9VMInternals.initializeImpl(*Native Method*)

... 54 more

Caused by: java.lang.NoClassDefFoundError: org.apache.log4j.Category

at java.lang.J9VMInternals.verifyImpl(*Native Method*)

at java.lang.J9VMInternals.verify(Unknown Source)

at java.lang.J9VMInternals.initialize(Unknown Source)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(*Native Method*)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)

at java.lang.reflect.Constructor.newInstance(Unknown Source)

... 60 more

Caused by: *java.lang.ClassNotFoundException*: org.apache.log4j.Category

at java.lang.Throwable.(Unknown Source)

at java.lang.ClassNotFoundException.(Unknown Source)

at
org.eclipse.osgi.framework.internal.core.BundleLoader.findClassInternal(Unknown
Source)

at org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(Unknown
Source)

at org.eclipse.osgi.framework.internal.core.BundleLoader.findClass(Unknown
Source)

at
org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(Unknown
Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

... 67 more


On Tue, Mar 3, 2009 at 1:16 PM, Steve Loughran  wrote:


Aviad sela wrote:


Nigel Thanks,
I have extracted the new project.

However, I am having problems building the project
I am using Eclipse 3.4
and ant 1.7

I recieve error compiling core classes
*

compile-core-classes*:

BUILD FAILED




*


D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Hadoop\build.xml:302:
java.lang.ExceptionInInitializerError

*
it points to the the webxml tag



Try an ant -verbose and post the full log, we may be able to look at the
problem more. Also, run an




Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Steve Loughran

Aviad sela wrote:

Nigel Thanks,
I have extracted the new project.

However, I am having problems building the project
I am using Eclipse 3.4
and ant 1.7

I recieve error compiling core classes
*

compile-core-classes*:

BUILD FAILED




*

D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Hadoop\build.xml:302:
java.lang.ExceptionInInitializerError

*
it points to the the webxml tag



Try an ant -verbose and post the full log, we may be able to look at the 
problem more. Also, run an


ant -diagnostics

and include what it prints

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: How does NVidia GPU compare to Hadoop/MapReduce

2009-03-02 Thread Steve Loughran

Dan Zinngrabe wrote:

On Fri, Feb 27, 2009 at 11:21 AM, Doug Cutting  wrote:

I think they're complementary.

Hadoop's MapReduce lets you run computations on up to thousands of computers
potentially processing petabytes of data.  It gets data from the grid to
your computation, reliably stores output back to the grid, and supports
grid-global computations (e.g., sorting).

CUDA can make computations on a single computer run faster by using its GPU.
 It does not handle co-ordination of multiple computers, e.g., the flow of
data in and out of a distributed filesystem, distributed reliability, global
computations, etc.

So you might use CUDA within mapreduce to more efficiently run
compute-intensive tasks over petabytes of data.

Doug


I actually did some work with this several months ago, using a
consumer-level NVIDIA card. I found a couple of interesting things:
- I used JOGL and OpenGL shaders rather than CUDA, as at least at the
time there was no reasonable way to talk to CUDA through java. That
made a number of things more complicated, CUDA certainly makes things
simpler. For the particular problem I was working with, GLSL was fine,
though CUDA would have simplified things.
- The problem set I was working with involved creating and searching
large amounts of hashes - 3-4 TB of them at a time.
- Only 2 of my nodes in an 8 node cluster had accelerators, but they
had a dramatic effect on performance. I do not have any of my test
results handy, but for this particular problem the accelerators cut
the job time in half or more.


that's interesting, as it means the power budget of the overall workload 
ought to be less




I would agree with Doug that the two are complimentary, though there
are some similarities. Working with the GPU means you are limited by
how much texture memory is available for storage (compared to HDFS,
not much!), and the cost of getting data on and off the card can be
high. Like many hadoop jobs, the overhead of getting data in and
starting a task can easily be greater than the length of the task
itself. For what I was doing, it was a good fit - but for many, many
problems it would not be the right solution.


Yes, and you will need more disk IO capacity per node if each node is 
capable of more computation, unless you have very CPU-intensive workloads




Re: HDFS architecture based on GFS?

2009-02-27 Thread Steve Loughran

kang_min82 wrote:
Hello Matei, 

Which Tasktracker did you mean here ? 


I don't understand that. In general we have mane Tasktrackers and each of
them runs on one separate Datanode. Why doesn't the JobTracker talk directly
to the Namenode for a list of Datanodes and then performs the MapReduce
tasks there.



1. There's no requirement for a 1:1 mapping of task-trackers to 
datanodes. You could bring up TT's on any machine with spare CPU cycles 
on your network, talking to a long lived filesystem built from a few 
datanodes


2. There's no requirement for HDFS. You could have a cluster of 
MapReduce nodes talking to other filesystems. Locality of data helps, 
but is not needed.


3. Layering makes for cleaner code.


Re: the question about the common pc?

2009-02-23 Thread Steve Loughran

Tim Wintle wrote:

On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote:
I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.


I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like


mkfifo mypipe1
mkfifo mypipe2
awk '0 == NR % 2' < mypipe1 | ./mapper | sort > map_out_1&

  awk '0 == (NR+1) % 2' < mypipe2 | ./mapper | sort > map_out_2&

./get_lots_of_data | tee mypipe1 > mypipe2


(wait until it's done... or send a signal from the "get_lots_of_data"
process on completion if it's a cronjob)


sort -m map_out* | ./reducer > reduce_out


works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.



Dumbo provides py support under Hadoop:
 http://wiki.github.com/klbostee/dumbo
 https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why it 
couldn't support the javax.script engine, with JavaScript working 
without extra JAR files, groovy and jython once their JARs were stuck on 
the classpath. Some work would probably be needed to make it easier to 
use these languages, and then there are the tests...


Re: How to use Hadoop API to submit job?

2009-02-20 Thread Steve Loughran

Wu Wei wrote:

Hi,

I used to submit Hadoop job with the utility RunJar.main() on hadoop 
0.18. On hadoop 0.19, because the commandLineConfig of JobClient was 
null, I got a NullPointerException error when RunJar.main() calls 
GenericOptionsParser to get libJars (0.18 didn't do this call). I also 
tried the class JobShell to submit job, but it catches all exceptions 
and sends to stderr so that I cann't handle the exceptions myself.


I noticed that if I can call JobClient's setCommandLineConfig method, 
everything goes easy. But this method has default package accessibility, 
I cann't see the method out of package org.apache.hadoop.mapred.


Any advices on using Java APIs to submit job?

Wei


Looking at my code, the line that does the work is

JobClient jc = new JobClient(jobConf);
runningJob = jc.submitJob(jobConf);

My full (LGPL) code is here : http://tinyurl.com/djk6vj

there's more work with validating input and output directories, pulling 
back the results, handling timeouts if the job doesnt complete, etc,etc, 
but that's feature creep


Re: the question about the common pc?

2009-02-20 Thread Steve Loughran

?? wrote:

Actually, there's a widely misunderstanding of this "Common PC" . Common PC 
doesn't means PCs which are daily used, It means the performance of each node, can be 
measured by common pc's computing power.

In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use 
linux for our document process, and most importantly, Hadoop cannot run effectively on 
thoese "daily pc"s.

 
Hadoop is designed for High performance computing equipment, but "claimed" to be fit for "daily pc"s.


Hadoop for pcs? what a joke.


Hadoop is designed to build a high throughput dataprocessing 
infrastructure from commodity PC parts. SATA not RAID or SAN, x68+linux 
not supercomputer hardware and OS. You can bring it up on lighter weight 
systems, but it has a minimium overhead that is quite steep for small 
datasets. I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.


-you need a good network, with DNS working (fast), good backbone and 
switches

-the faster your disks, the better your throughput
-ECC memory makes a lot of sense
-you need a good cluster management setup unless you like SSH-ing to 20 
boxes to find out which one is playing up


Re: GenericOptionsParser warning

2009-02-20 Thread Steve Loughran

Rasit OZDAS wrote:

Hi,
There is a JIRA issue about this problem, if I understand it correctly:
https://issues.apache.org/jira/browse/HADOOP-3743

Strange, that I searched all source code, but there exists only this control
in 2 places:

if (!(job.getBoolean("mapred.used.genericoptionsparser", false))) {
  LOG.warn("Use GenericOptionsParser for parsing the arguments. " +
   "Applications should implement Tool for the same.");
}

Just an if block for logging, no extra controls.
Am I missing something?

If your class implements Tool, than there shouldn't be a warning.


OK, for my automated submission code I'll just set that switch and I 
won't get told off.


Re: GenericOptionsParser warning

2009-02-18 Thread Steve Loughran

Sandhya E wrote:

Hi All

I prepare my JobConf object in a java class, by calling various set
apis in JobConf object. When I submit the jobconf object using
JobClient.runJob(conf), I'm seeing the warning:
"Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same". From hadoop sources it looks like
setting "mapred.used.genericoptionsparser" will prevent this warning.
But if I set this flag to true, will it have some other side effects.

Thanks
Sandhya


Seen this message too -and it annoys me; not tracked it down


Re: HADOOP-2536 supports Oracle too?

2009-02-17 Thread Steve Loughran

sandhiya wrote:

Hi,
I'm using postgresql and the driver is not getting detected. How do you run
it in the first place? I just typed 


bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in my
Libraries folder (I'm using NetBeans).
please help




JDBC drivers need to be (somehow) loaded before you can resolve the 
relevant jdbc urls; somehow your code needs to call 
Class.forName("jdbcdrivername"), where  that string is set to the 
relevant jdbc driver classname


Re: HDFS architecture based on GFS?

2009-02-17 Thread Steve Loughran

Amr Awadallah wrote:

I didn't understand usage of "malicuous" here,
but any process using HDFS api should first ask NameNode where the


Rasit,

  Matei is referring to fact that a malicious peace of code can bypass the
Name Node and connect to any data node directly, or probe all data nodes for
that matter. There is no strong authentication for RPC at this layer of
HDFS, which is one of the current shortcomings that will be addressed in
hadoop 1.0.

-- amr


This shouldn't be a problem in a locked down datacentre. Where it is a 
risk is when you host on EC2 or other Vm-hosting service that doesn't 
set up private VPNs


Re: datanode not being started

2009-02-17 Thread Steve Loughran

Sandy wrote:



Since I last used this machine, Parallels Desktop was installed by the
admin. I am currently suspecting that somehow this is interfering with the
function of Hadoop  (though Java_HOME still seems to be ok). Has anyone had
any experience with this being a cause of interference?



It could have added >1 virtual network adapter, and hadoop is starting 
on the wrong adapter.


I dont think Hadoop handles this situation that well (yet), as you
 -need to be able to specify the adapter for every node
 -get rid of the notion of "I have a hostname" and move to "every 
network adapter has its own hostname"
I haven't done enough experiments to be sure. I do know that if you 
start a datanode with IP addresses for the filesystem, it works out the 
hostname and then complains if anyone tries to talk to it using the same 
ip address URL it booted with.


-steve


Re: stable version

2009-02-16 Thread Steve Loughran

Anum Ali wrote:

The parser problem is related to jar files , can be resolved not a bug.

Forwarding link for its solution


http://www.jroller.com/navanee/entry/unsupportedoperationexception_this_parser_does_not



this site is  down; cant see it

It is a bug, because I view all operations problems as defects to be 
opened in the bug tracker, stack traces stuck in, the problem resolved. 
That's software or hardware -because that issue DB is your searchable 
history of what went wrong. Given on my system I was seeing a 
ClassNotFoundException for loading FSConstants, there was no easy way to 
work out what went wrong, and its cost me a couple of days work.


furthermore, in the OSS world, every person who can't get your app to 
work is either going to walk away unhappy (=lost customer, lost 
developer and risk they compete with you), or they are going to get on 
the email list and ask questions, questions which may get answered, but 
it will cost them time.


Hence
* happyaxis.jsp: axis' diagnostics page, prints out useful stuff and 
warns if it knows it is unwell (and returns 500 error code so your 
monitoring tools can recognise this)
* ant -diagnostics: detailed look at your ant system including xml 
parser experiments.


Good open source tools have to be easy for people to get started with, 
and that means helpful error messages. If we left the code alone, 
knowing that the cause of a ClassNotFoundException was the fault of the 
user sticking the wrong XML parser on the classpath -and yet refusing to 
add the four lines of code needed to handle this- then we are letting 
down the users





On 2/13/09, Steve Loughran  wrote:

Anum Ali wrote:

 This only occurs in linux , in windows its  fine.

do a java -version for me, and an ant -diagnostics, stick both on the bugrep

https://issues.apache.org/jira/browse/HADOOP-5254

It may be that XInclude only went live in java1.6u5; I'm running a
JRockit JVM which predates that and I'm seeing it (linux again);

I will also try sticking xerces on the classpath to see what happens next





Re: stable version

2009-02-13 Thread Steve Loughran

Anum Ali wrote:

 This only occurs in linux , in windows its  fine.


do a java -version for me, and an ant -diagnostics, stick both on the bugrep

https://issues.apache.org/jira/browse/HADOOP-5254

It may be that XInclude only went live in java1.6u5; I'm running a 
JRockit JVM which predates that and I'm seeing it (linux again);


I will also try sticking xerces on the classpath to see what happens next


Re: stable version

2009-02-13 Thread Steve Loughran

Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran  wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread "main" java.lang.UnsupportedOperationException: This
parser does not support specification "null" version "null"  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?






I'm seeing this too, filed the bug

https://issues.apache.org/jira/browse/HADOOP-5254

Any stack traces you can add on will help. Probable cause is 
https://issues.apache.org/jira/browse/HADOOP-4944


Re: Namenode not listening for remote connections to port 9000

2009-02-13 Thread Steve Loughran

Michael Lynch wrote:

Hi,

As far as I can tell I've followed the setup instructions for a hadoop 
cluster to the letter,
but I find that the datanodes can't connect to the namenode on port 9000 
because it is only

listening for connections from localhost.

In my case, the namenode is called centos1, and the datanode is called 
centos2. They are

centos 5.1 servers with an unmodified sun java 6 runtime.


fs.default.name takes a URL to the filesystem. such as hdfs://centos1:9000/

If the machine is only binding to localhost, that may mean DNS fun. Try 
a fully qualified name instead


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran  wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread "main" java.lang.UnsupportedOperationException: This
parser does not support specification "null" version "null"  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?





well, in that case something being passed down to setXIncludeAware may 
be picked up as invalid. More of a stack trace may help. Otherwise, now 
is your chance to learn your way around the hadoop codebase, and ensure 
that when the next version ships, your most pressing bugs have been fixed


Re: Best practices on spliltting an input line?

2009-02-12 Thread Steve Loughran

Stefan Podkowinski wrote:

I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. "a","a,b","c".


I've used it in the past; found it pretty reliable. Again, no perf 
tests, just reading in CSV files exported from spreadsheets


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread "main" java.lang.UnsupportedOperationException: This
parser does not support specification "null" version "null"  at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.




You are using Java6, correct?


Re: File Transfer Rates

2009-02-11 Thread Steve Loughran

Brian Bockelman wrote:
Just to toss out some numbers (and because our users are making 
interesting numbers right now)


Here's our external network router: 
http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets 



Here's the application-level transfer graph: 
http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska 



In a squeeze, we can move 20-50TB / day to/from other heterogenous 
sites.  Usually, we run out of free space before we can find the upper 
limit for a 24-hour period.


We use a protocol called GridFTP to move data back and forth between 
external (non-HDFS) clusters.  The other sites we transfer with use 
niche software you probably haven't heard of (Castor, DPM, and dCache) 
because, well, it's niche software.  I have no available data on 
HDFS<->S3 systems, but I'd again claim it's mostly a function of the 
amount of hardware you throw at it and the size of your network pipes.


There are currently 182 datanodes; 180 are "traditional" ones of <3TB 
and 2 are big honking RAID arrays of 40TB.  Transfers are load-balanced 
amongst ~ 7 GridFTP servers which each have 1Gbps connection.




GridFTP is optimised for high bandwidth network connections with 
negotiated packet size and multiple TCP connections, so when nagel's 
algorithm triggers backoff from a dropped packet, only a fraction of the 
transmission gets dropped. It is probably best-in-class for long haul 
transfers over the big university backbones where someone else pays for 
your traffic. You would be very hard pressed to get even close to that 
on any other protocol.


I have no data on S3 xfers other than hearsay
 * write time to S3 can be slow as it doesn't return until the data is 
persisted "somewhere". That's a better guarantee than a posix write 
operation.
 * you have to rely on other people on your rack not wanting all the 
traffic for themselves. That's an EC2 API issue: you don't get to 
request/buy bandwidth to/from S3


One thing to remember is that if you bring up a Hadoop cluster on any 
virtual server farm, disk IO is going to be way below physical IO rates. 
Even when the data is in HDFS, it will be slower to get at than 
dedicated high-RPM SCSI or SATA storage.


Re: anybody knows an apache-license-compatible impl of Integer.parseInt?

2009-02-11 Thread Steve Loughran

Zheng Shao wrote:

We need to implement a version of Integer.parseInt/atoi from byte[] instead of 
String to avoid the high cost of creating a String object.

I wanted to take the open jdk code but the license is GPL:
http://www.docjar.com/html/api/java/lang/Integer.java.html

Does anybody know an implementation that I can use for hive (apache license)?


I also need to do it for Byte, Short, Long, and Double. Just don't want to go 
over all the corner cases.


Use the Apache Harmony code
http://svn.apache.org/viewvc/harmony/enhanced/classlib/branches/java6/modules/


Re: Backing up HDFS?

2009-02-11 Thread Steve Loughran

Allen Wittenauer wrote:

On 2/9/09 4:41 PM, "Amandeep Khurana"  wrote:

Why would you want to have another backup beyond HDFS? HDFS itself
replicates your data so if the reliability of the system shouldnt be a
concern (if at all it is)...


I'm reminded of a previous job where a site administrator refused to make
tape backups (despite our continual harassment and pointing out that he was
in violation of the contract) because he said RAID was "good enough".

Then the RAID controller failed. When we couldn't recover data "from the
other mirror" he was fired.  Not sure how they ever recovered, esp.
considering what the data was they lost.  Hopefully they had a paper trail.


hope that wasnt at SUNW, not given they do their own controllers

1. controller failure is lethal, especially if you don't notice for a while
2. some products -say, databases- didnt like live updates, so a trick 
evolved of taking off some of the RAID array and putting that to tape. 
Of course, then there's the problem of what happens there
3. Tape is still very power efficient; good for a bulk off-site store 
(or a local fire-safe)
4. Over at last.fm, they had an accident rm / on their primary dataset. 
Fortunately they did apparently have another copy somewhere else. and 
now that hfds has user ids, you can prevent anyone but the admin team 
from accidentally deleting everyones data.


  1   2   3   >