Re: FYI, Large-scale graph computing at Google

2009-06-29 Thread Steve Loughran

Patterson, Josh wrote:

Steve,
I'm a little lost here; Is this a replacement for M/R or is it some new
code that sits ontop of M/R that runs an iteration over some sort of
graph's vertexes? My quick scan of Google's article didn't seem to yeild
a distinction. Either way, I'd say for our data that a graph processing
lib for M/R would be interesting.



I'm thinking of graph algorithms that get implemented as MR jobs; work 
with HDFS, HBase, etc.


Re: FYI, Large-scale graph computing at Google

2009-06-29 Thread Steve Loughran

Edward J. Yoon wrote:

I just made a wiki page -- http://wiki.apache.org/hadoop/Hambrug --
Let's discuss about the graph computing framework named Hambrug.



ok, first Q, why the Hambrug. To me that's just Hamburg typed wrong, 
which is going to cause lots of confusion.


What about something more graphy? like descartes


Re: Hadoop0.20 - Class Not Found exception

2009-06-29 Thread Steve Loughran

Amandeep Khurana wrote:

I'm getting the following error while starting a MR job:

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
oracle.jdbc.driver.OracleDriver
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297)
... 21 more
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at
org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123)
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292)
... 21 more

Interestingly, the relevant jar is bundled into the MR job jar and its also
there in the $HADOOP_HOME/lib directory.

Exactly same thing worked with 0.19.. Not sure what could have changed or I
broke to cause this error...


could be classloader hierarchy; the JDBC driver needs to be at the right 
level. Try preheating the driver by loading it in your own code, then 
jdbc:URLs might work, and take it out of the MR Job JAR


Re: Hadoop 0.20.0, xml parsing related error

2009-06-25 Thread Steve Loughran

Ram Kulbak wrote:

Hi,
The exception is a result of having xerces in the classpath. To resolve,
make sure you are using Java 6 and set the following system property:

-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl


This can also be resolved by the Configuration class(line 1045) making sure
it loads the DocumentBuilderFactory bundled with the JVM and not a 'random'
classpath-dependent factory..
Hope this helps,
Ram


Lovely -I've noted this in the comments of the bugrep


Re: FYI, Large-scale graph computing at Google

2009-06-25 Thread Steve Loughran

Edward J. Yoon wrote:

What do you think about another new computation framework on HDFS?

On Mon, Jun 22, 2009 at 3:50 PM, Edward J. Yoon edwardy...@apache.org wrote:

http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
-- It sounds like Pregel seems, a computing framework based on dynamic
programming for the graph operations. I guess maybe they removed the
file communications/intermediate files during iterations.

Anyway, What do you think?


I have a colleague (paolo) who would be interested in adding a set of 
graph algorithms on top of the MR engine


Re: FYI, Large-scale graph computing at Google

2009-06-25 Thread Steve Loughran

mike anderson wrote:

This would be really useful for my current projects. I'd be more than happy
to help out if needed.



well the first bit of code to play with then is this

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/extras/citerank/

the standalone.xml file is the one you want to build and run with, the 
other would require you to check out and build two levels up, but gives 
you the ability to bring up local or remote clusters to test. Call 
run-local to run it locally., which should give you some stats like this:


 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Counters: 11
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool:   File Systems
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local 
bytes read=209445683448
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Local 
bytes written=173943642259
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool:   Map-Reduce 
Framework
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
input groups=9985124
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine 
output records=34
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input 
records=24383448
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
output records=16494967
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map 
output bytes=1243216870
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map input 
bytes=1528854187
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Combine 
input records=4528655
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Map 
output records=41958636
 [java] 09/06/25 17:09:22 INFO citerank.CiteRankTool: Reduce 
input records=37430015


==
Exiting project citerank
==

BUILD SUCCESSFUL - at 25/06/09 17:09
Total time: 9 minutes 1 second

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Name Node HA (HADOOP-4539)

2009-06-22 Thread Steve Loughran

Andrew Wharton wrote:

https://issues.apache.org/jira/browse/HADOOP-4539

I am curious about the state of this fix. It is listed as
Incompatible, but is resolved and committed (according to the
comments). Is the backup name node going to make it into 0.21? Will it
remove the SPOF for HDFS? And if so, what is the proposed release
timeline for 0.21?




The way to deal with HA -which the BackupNode doesn't promise- is to get 
involved in developing and testing the leading edge source tree.


The 0.21 cutoff is approaching, BackupNode is in there, but it needs a 
lot more tests. If you want to aid the development, helping to get more 
automated BackupNode tests in there (indeed, tests that simulate more 
complex NN failures, like a corrupt EditLog) would go a long way.


-steve


Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

jason hadoop wrote:

Yes.
Otherwise the file descriptors will flow away like water.
I also strongly suggest having at least 64k file descriptors as the open
file limit.

On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin stas.os...@gmail.com wrote:


Hi.

Thanks for the advice. So you advice explicitly closing each and every file
handle that I receive from HDFS?

Regards.


I must disagree somewhat

If you use FileSystem.get() to get your client filesystem class, then 
that is shared by all threads/classes that use it. Call close() on that 
and any other thread or class holding a reference is in trouble. You 
have to wait for the finalizers for them to get cleaned up.


If you use FileSystem.newInstance() - which came in fairly recently 
(0.20? 0.21?) then you can call close() safely.


So: it depends on how you get your handle.

see: https://issues.apache.org/jira/browse/HADOOP-5933

Also: the too many open files problem can be caused in the NN  -you need 
to set up the Kernel to have lots more file handles around. Lots.




Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Scott Carey wrote:

Furthermore, if for some reason it is required to dispose of any objects after 
others are GC'd, weak references and a weak reference queue will perform 
significantly better in throughput and latency - orders of magnitude better - 
than finalizers.




Good point.

I would make sense for the FileSystem cache to be weak referenced, so 
that on long-lived processes the client references will get cleaned up 
without waiting for app termination


Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Raghu Angadi wrote:


Is this before 0.20.0? Assuming you have closed these streams, it is 
mostly https://issues.apache.org/jira/browse/HADOOP-4346


It is the JDK internal implementation that depends on GC to free up its 
cache of selectors. HADOOP-4346 avoids this by using hadoop's own cache.


yes, and it's that change that led to my stack traces :(

http://jira.smartfrog.org/jira/browse/SFOS-1208


Re: Too many open files error, which gets resolved after some time

2009-06-22 Thread Steve Loughran

Stas Oskin wrote:

Hi.

So what would be the recommended approach to pre-0.20.x series?

To insure each file is used only by one thread, and then it safe to close
the handle in that thread?

Regards.


good question -I'm not sure. For anythiong you get with 
FileSystem.get(), its now dangerous to close, so try just setting the 
reference to null and hoping that GC will do the finalize() when needed


Re: Running Hadoop/Hbase in a OSGi container

2009-06-12 Thread Steve Loughran

Ninad Raut wrote:

OSGi provides navigability to your components and create a life cycle for
each of those components viz; install. start, stop, un- deploy etc.
This is the reason why we are thinking of creating components using OSGi.
The problem we are facing is our components using mapreduce and HDFS, as
such OSGi container cannot detect hadoop mapred engine or HDFS.

I  have searched through the net and looks like people are working or have
achieved success in running hadoop in OSGi container

Ninad



1. I am doing work on a simple lifecycle for the services, 
start/stop/ping, which is not OSGI (which worries a lot about 
classloading and versioning, check out HADOOP-3628 for this.


2. You can run it under OSGi systems, such as the OSGi branch of 
SmartFrog : 
http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/branches/core-branch-osgi/, 
or under non-OSGi tools. Either way, these tools are left dealing with 
classloading and the like.


3. Any container is going to have to deal with the problem that there 
are bits of all the services that call System.Exit() by running under a 
security manager, trapping the call, raising an exception etc.


4. Any container is going to have  to then deal with the fact that from 
0.20 onwards, Hadoop does things with security policy that are 
incompatible with normal Java security managers. whatever security 
manager you have for trapping system exits, can't extend the default one.


5. any container also has to deal with every service (namenode, job 
tracker, etc) makes a lot of assumptions about singletons, that they 
have exclusive use of filesystem objects retrieved through 
FileSystem.get(), and the like. While OSGi can do that with its 
classloading work, its still fairly complex.


6. There are also lots of JVM memory/thread management issues, see the 
various Hadoop bugs


If you look at the slides of what I've been up to, you can see that it 
can be done

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/hadoop/doc/dynamic_hadoop_clusters.ppt

However,
 * you really need to run every service in its own process, for memory 
and reliability alone

 * It's pretty leading edge
 * You will have to invest the time and effort to get it working

If you want to do the work, start with what I've been doing, bring it up 
under the OSGi container of your choice. You can come and play with our 
tooling, I'm cutting a release today of this week's Hadoop trunk merged 
with my branch, it is of course experimental, as even the trunk is a bit 
up-and-down on feature stability.


-steve



Re: Multiple NIC Cards

2009-06-10 Thread Steve Loughran

John Martyniak wrote:
Does hadoop cache the server names anywhere?  Because I changed to 
using DNS for name resolution, but when I go to the nodes view, it is 
trying to view with the old name.  And I changed the hadoop-site.xml 
file so that it no longer has any of those values.




in SVN head, we try and get Java to tell us what is going on
http://svn.apache.org/viewvc/hadoop/core/trunk/src/core/org/apache/hadoop/net/DNS.java

This uses InetAddress.getLocalHost().getCanonicalHostName() to get the 
value, which is cached for life of the process. I don't know of anything 
else, but wouldn't be surprised -the Namenode has to remember the 
machines where stuff was stored.





Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:

David,

For the Option #1.

I just changed the names to the IP Addresses, and it still comes up as 
the external name and ip address in the log files, and on the job 
tracker screen.


So option 1 is a no go.

When I change the dfs.datanode.dns.interface values it doesn't seem to 
do anything.  When I was search archived mail, this seemed to be a the 
approach to change the NIC card being used for resolution.  But when I 
change it nothing happens, I even put in bogus values and still no issues.


-John



I've been having similar but different fun with Hadoop-on-VMs, there's a 
lot of assumption that DNS and rDNS all works consistently in the code. 
Do you have separate internal and external hostnames? In which case, can 
you bring up the job tracker as jt.internal , namenode as nn.internal 
(so the full HDFS URl is something like hdfs://nn.internal/ ) , etc, etc.?


Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:
My original names where huey-direct and duey-direct, both names in the 
/etc/hosts file on both machines.


Are nn.internal and jt.interal special names?


 no, just examples on a multihost network when your external names 
could be something completely different.


What does /sbin/ifconfig say on each of the hosts?



Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:

I am running Mac OS X.

So en0 points to the external address and en1 points to the internal 
address on both machines.


Here is the internal results from duey:
en1: flags=8963UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST 
mtu 1500

inet6 fe80::21e:52ff:fef4:65%en1 prefixlen 64 scopeid 0x5
inet 192.168.1.102 netmask 0xff00 broadcast 192.168.1.255
ether 00:1e:52:f4:00:65
media: autoselect (1000baseT full-duplex) status: active



lladdr 00:23:32:ff:fe:1a:20:66
media: autoselect full-duplex status: inactive
supported media: autoselect full-duplex

Here are the internal results from huey:
en1: flags=8863UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST mtu 1500
inet6 fe80::21e:52ff:fef3:f489%en1 prefixlen 64 scopeid 0x5
inet 192.168.1.103 netmask 0xff00 broadcast 192.168.1.255


what does
  nslookup 192.168.1.103
and
  nslookup 192.168.1.102
say?

There really ought to be different names for them.


 I have some other applications running on these machines, that
 communicate across the internal network and they work perfectly.

I admire their strength. Multihost systems cause us trouble. That and 
machines that don't quite know who they are

http://jira.smartfrog.org/jira/browse/SFOS-5
https://issues.apache.org/jira/browse/HADOOP-3612
https://issues.apache.org/jira/browse/HADOOP-3426
https://issues.apache.org/jira/browse/HADOOP-3613
https://issues.apache.org/jira/browse/HADOOP-5339

One thing to consider is that some of the various services of Hadoop are 
bound to 0:0:0:0, which means every Ipv4 address, you really want to 
bring up everything, including jetty services, on the en0 network 
adapter, by binding them to  192.168.1.102; this will cause anyone 
trying to talk to them over the other network to fail, which at least 
find the problem sooner rather than later




Re: Multiple NIC Cards

2009-06-09 Thread Steve Loughran

John Martyniak wrote:
When I run either of those on either of the two machines, it is trying 
to resolve against the DNS servers configured for the external addresses 
for the box.


Here is the result
Server:xxx.xxx.xxx.69
Address:xxx.xxx.xxx.69#53


OK. in an ideal world, each NIC has a different hostname. Now, that 
confuses code that assumes a host has exactly one hostname, not zero or 
two, and I'm not sure how well Hadoop handles the 2+ situation (I know 
it doesn't like 0, but hey, its a distributed application). With 
separate hostnames, you set hadoop up to work on the inner addresses, 
and give out the inner hostnames of the jobtracker and namenode. As a 
result, all traffic to the master nodes should be routed on the internal 
network


Re: Hadoop scheduling question

2009-06-08 Thread Steve Loughran

Aaron Kimball wrote:


Finally, there's a third scheduler called the Capacity scheduler. It's
similar to the fair scheduler, in that it allows guarantees of minimum
availability for different pools. I don't know how it apportions additional
extra resources though -- this is the one I'm least familiar with. Someone
else will have to chime in here.





There's a dynamic priority scheduler in the patch queue, that I've 
promised to commit this week. Its the one with a notion of currency: you 
pay for your work/priority. At peak times, work costs more

https://issues.apache.org/jira/browse/HADOOP-4768


Re: Every time the mapping phase finishes I see this

2009-06-08 Thread Steve Loughran

Mayuran Yogarajah wrote:
There are always a few 'Failed/Killed Task Attempts' and when I view the 
logs for

these I see:

- some that are empty, ie stdout/stderr/syslog logs are all blank
- several that say:

2009-06-06 20:47:15,309 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child

java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:195)
at org.apache.hadoop.dfs.DFSClient.access$600(DFSClient.java:59)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1359)

at java.io.FilterInputStream.close(FilterInputStream.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:103) 

at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:301)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:173) 


at org.apache.hadoop.mapred.MapTask.run(MapTask.java:231)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)




Any idea why this happens? I don't understand why I'd be seeing these 
only as the mappers get to

100%.


Seen this when something in the same process got a FileSystem reference 
by FileSystem.get() and then called close() on it -it closes the client 
for every thread/class that has a reference to the same object.



We're planning on adding more diagnostics, by tracking who closed the 
filesystem

https://issues.apache.org/jira/browse/HADOOP-5933


Re: Monitoring hadoop?

2009-06-07 Thread Steve Loughran

Matt Massie wrote:

Anthony-

The ganglia web site is at http://ganglia.info/ with documentation in a 
wiki at http://ganglia.wiki.sourceforge.net/.  There is also a good wiki 
page at IBM as well 
http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia .  
Ganglia packages are available for most distributions to help with 
installation so make sure to grep for ganglia with your favorite package 
manager (e.g. aptitude, yum, etc).  Ganglia will give you more 
information about your cluster than just Hadoop metrics.  You'll get 
CPU, load, memory, disk and network  monitoring as well for free.


You can see live demos of ganglia at http://ganglia.info/?page_id=69.

Good luck.

-Matt


Out of Modesty, Matt neglects to mention that Ganglia one of his 
projects, so not only does it work well with Hadoop today, I would 
expect the integration to only get better over time.


Anthony -don't forget to feed those stats back into your DFS for later 
analysis...


Re: Hadoop ReInitialization.

2009-06-03 Thread Steve Loughran

b wrote:

But after formatting and starting DFS i need to wait some time (sleep 
60) before putting data into HDFS. Else i will receive 
NotReplicatedYetException.


that means the namenode is up but there aren't enough workers yet.


Re: question about when shuffle/sort start working

2009-06-01 Thread Steve Loughran

Todd Lipcon wrote:

Hi Jianmin,

This is not (currently) supported by Hadoop (or Google's MapReduce either
afaik). What you're looking for sounds like something more like Microsoft's
Dryad.

One thing that is supported in versions of Hadoop after 0.19 is JVM reuse.
If you enable this feature, task trackers will persist JVMs between jobs.
You can then persist some state in static variables.

I'd caution you, however, from making too much use of this fact as anything
but an optimization. The reason that Hadoop is limited to MR (or M+RM* as
you said) is that simplicity and reliability often go hand in hand. If you
start maintaining important state in RAM on the tasktracker JVMs, and one of
them goes down, you may need to restart your entire job sequence from the
top. In typical MapReduce, you may need to rerun a mapper or a reducer, but
the state is all on disk ready to go.

-Todd




I'd thought the question is not necessarily one of maintaining state, 
but of chaining the output from one job into another, where the # of 
iterations depends on the outcome of the previous set. Funnily enough, 
this is what you (apparently) end up having to do when implementing 
PageRank-like ranking as MR jobs:

http://skillsmatter.com/podcast/cloud-grid/having-fun-with-pagerank-and-mapreduce


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-29 Thread Steve Loughran

ashish pareek wrote:

Yes I am able to ping and ssh between two virtual machine and even i
have set ip address of both the virtual machines in their respective
/etc/hosts file ...

thanx for reply .. if you suggest some other thing which i could
have missed or any remedy 

Regards,
Ashish Pareek.


VMs? VMWare? Xen? Something else?

I've encountered problems on virtual networks where the machines aren't 
locatable via DNS., and can't be sure who they say they are.


1. start the machines individually, instead of the start-all script that 
needs to have SSH working too.

2. check with netstat -a to see what ports/interfaces they are listening on

-steve



Re: hadoop hardware configuration

2009-05-28 Thread Steve Loughran

Patrick Angeles wrote:

Sorry for cross-posting, I realized I sent the following to the hbase list
when it's really more a Hadoop question.


This is an interesting question. Obviously as an HP employee you must 
assume that I'm biased when I say HP DL160 servers are good  value for 
the workers, though our blade systems are very good for a high physical 
density -provided you have the power to fill up the rack.




2 x Hadoop Master (and Secondary NameNode)

   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
   - 16GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - Hardware RAID controller
   - Redundant Power Supply
   - Approx. 390W power draw (1.9amps 208V)
   - Approx. $4000 per unit


I do not know the what the advantages of that many cores are on a NN. 
Someone needs to do some experiments. I do know you need enough RAM to 
hold the index in memory, and you may want to go for a bigger block size 
to keep the index size down.




6 x Hadoop Task Nodes

   - 1 x 2.3Ghz Quad Core (Opteron 1356)
   - 8GB DDR2-800 Registered ECC Memory
   - 4 x 1TB 7200rpm SATA II Drives
   - No RAID (JBOD)
   - Non-Redundant Power Supply
   - Approx. 210W power draw (1.0amps 208V)
   - Approx. $2000 per unit

I had some specific questions regarding this configuration...




   1. Is hardware RAID necessary for the master node?



You need a good story to ensure that loss of a disk on the master 
doesn't lose the filesystem. I like RAID there, but the alternative is 
to push the stuff out over the network to other storage you trust. That 
could be NFS-mounted RAID storage, it could be NFS mounted JBOD. 
Whatever your chosen design, test it works before you go live by running 
the cluster then simulate different failures, see how well the 
hardware/ops team handles it.


Keep an eye on where that data goes, because when the NN runs out of 
file storage, the consequences can be pretty dramatic (i,e the cluster 
doesnt come up unless you edit the editlog by hand)



   2. What is a good processor-to-storage ratio for a task node with 4TB of
   raw storage? (The config above has 1 core per 1TB of raw storage.)


That really depends on the work you are doing...the bytes in/out to CPU 
work, and the size of any memory structures that are built up over the run.


With 1 core per physical disk, you get the bandwidth of a single disk 
per CPU; for some IO-intensive work you can make the case for two 
disks/CPU -one in, one out, but then you are using more power, and 
if/when you want to add more storage, you have to pull out the disks to 
stick in new ones. If you go for more CPUs, you will probably need more 
RAM to go with it.



   3. Am I better off using dual quads for a task node, with a higher power
   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly $3200,
   but draws almost 2x as much power. The tradeoffs are:
  1. I will get more CPU per dollar and per watt.
  2. I will only be able to fit 1/2 as much dual quad machines into a
  rack.
  3. I will get 1/2 the storage capacity per watt.
  4. I will get less I/O throughput overall (less spindles per core)


First there is the algorithm itself, and whether you are IO or CPU 
bound. Most MR jobs that I've encountered are fairly IO bound -without 
indexes, every lookup has to stream through all the data, so it's power 
inefficient and IO limited. but if you are trying to do higher level 
stuff than just lookup, then you will be doing more CPU-work


Then there is the question of where your electricity comes from, what 
the limits for the room are, whether you are billed on power drawn or 
quoted PSU draw, what the HVAC limits are, what the maximum allowed 
weight per rack is, etc, etc.


I'm a fan of low Joule work, though we don't have any benchmarks yet of 
the power efficiency of different clusters; the number of MJ used to do 
a a terasort. I'm debating doing some single-cpu tests for this on my 
laptop, as the battery knows how much gets used up by some work.



   4. In planning storage capacity, how much spare disk space should I take
   into account for 'scratch'? For now, I'm assuming 1x the input data size.


That you should probably be able to determine on experimental work on 
smaller datasets. Some maps can throw out a lot of data, most reduces do 
actually reduce the final amount.



-Steve

(Disclaimer: I'm not making any official recommendations for hardware 
here, just making my opinions known. If you do want an official 
recommendation from HP, talk to your reseller or account manager, 
someone will look at your problem in more detail and make some 
suggestions. If you have any code/data that could be shared for 
benchmarking, that would help validate those suggestions)




Re: ssh issues

2009-05-26 Thread Steve Loughran

hmar...@umbc.edu wrote:

Steve,

Security through obscurity is always a good practice from a development
standpoint and one of the reasons why tricking you out is an easy task.


:)

My most recent presentation on HDFS clusters is now online, notice how it
doesn't gloss over the security: 
http://www.slideshare.net/steve_l/hdfs-issues



Please, keep hiding relevant details from people in order to keep everyone
smiling.



HDFS is as secure as NFS: you are trusted to be who you say you are. 
Which means that you have to run it on a secured subnet -access 
restricted to trusted hosts and/or one two front end servers or accept 
that your dataset is readable and writeable by anyone on the network.


There is user identification going in; it is currently at the level 
where it will stop someone accidentally deleting the entire filesystem 
if they lack the rights. Which has been known to happen.


If the team looking after the cluster demand separate SSH keys/login for 
every machine then not only are they making their operations costs high, 
once you have got the HDFS cluster and MR engine live, it's moot. You 
can push out work to the JobTracker, which then runs it on the machines, 
under whatever userid the TaskTrackers are running on. Now,  0.20+ will 
run it under the identity of the user who claimed to be submitting the 
job, but without that, your MR Jobs get the access rights to the 
filesystem of the user that is running the TT, but it's fairly 
straightforward to create a modified hadoop client JAR that doesn't call 
whoami to get the userid, and instead spoofs to be anyone. Which means 
that even if you lock down the filesystem -no out of datacentre access-, 
if I can run my java code as MR jobs in your cluster, I can have 
unrestricted access to the filesystem by way of the task tracker server.


But Hal, if you are running Ant for your build I'm running my code on 
your machines anyway, so you had better be glad that I'm not malicious.


-Steve


Re: ssh issues

2009-05-22 Thread Steve Loughran

Pankil Doshi wrote:

Well i made ssh with passphares. as the system in which i need to login
requires ssh with pass phrases and those systems have to be part of my
cluster. and so I need a way where I can specify -i path/to key/ and
passphrase to hadoop in before hand.

Pankil



Well, are trying to manage a system whose security policy is 
incompatible with hadoop's current shell scripts. If you push out the 
configs and manage the lifecycle using other tools, this becomes a 
non-issue. Dont raise the topic of HDFS security to your ops team 
though, as they will probably be unhappy about what is currently on offer.


-steve


Re: Username in Hadoop cluster

2009-05-21 Thread Steve Loughran

Pankil Doshi wrote:

Hello everyone,

Till now I was using same username on all my hadoop cluster machines.

But now I am building my new cluster and face a situation in which I have
different usernames for different machines. So what changes will have to
make in configuring hadoop. using same username ssh was easy. now will it
face problem as now I have different username?


Are you building these machines up by hand? How many? Why the different 
usernames?


Can't you just create a new user and group hadoop on all the boxes?


Re: Optimal Filesystem (and Settings) for HDFS

2009-05-20 Thread Steve Loughran

Bryan Duxbury wrote:
We use XFS for our data drives, and we've had somewhat mixed results. 



Thanks for that. I've just created a wiki page to put some of these 
notes up -extensions and some hard data would be welcome


http://wiki.apache.org/hadoop/DiskSetup

One problem we have for hard data is that we need some different 
benchmarks for MR jobs. Terasort is good for measuring IO and MR 
framework performance, but for more CPU intensive algorithms, or things 
that need to seek round a bit more, you can't be sure that terasort 
benchmarks are a good predictor of what's right for you in terms of 
hardware, filesystem, etc.


Contributions in this area would be welcome.

I'd like to measure the power consumed on a run too, which is actually 
possible as far as my laptop is concerned, because you can ask it's 
battery what happened.


-steve


Re: Suspend or scale back hadoop instance

2009-05-19 Thread Steve Loughran

John Clarke wrote:

Hi,

I am working on a project that is suited to Hadoop and so want to create a
small cluster (only 5 machines!) on our servers. The servers are however
used during the day and (mostly) idle at night.

So, I want Hadoop to run at full throttle at night and either scale back or
suspend itself during certain times.


You could add/remove new task trackers on idle systems, but
* you don't want to take away datanodes, as there's a risk that data 
will become unavailable.
* there's nothing in the scheduler to warn that machines will go away at 
a certain time
If you only want to run the cluster at night, I'd just configure the 
entire cluster to go up and down


Re: Beware sun's jvm version 1.6.0_05-b13 on linux

2009-05-18 Thread Steve Loughran

Allen Wittenauer wrote:



On 5/15/09 11:38 AM, Owen O'Malley o...@yahoo-inc.com wrote:


We have observed that the default jvm on RedHat 5


I'm sure some people are scratching their heads at this.

The default JVM on at least RHEL5u0/1 is a GCJ-based 1.4, clearly
incapable of running Hadoop.  We [and, really, this is my doing... ^.^ ]
replace it with the JVM from the JPackage folks.  So while this isn't the
default JVM that comes from RHEL, the warning should still be heeded. 



Presumably its one of those hard-to-reproduce race conditions that only 
surfaces under load on a big cluster so is hard to replicate in a unit 
test, right?




Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Steve Loughran

Grace wrote:

To follow up this question, I have also asked help on Jrockit forum. They
kindly offered some useful and detailed suggestions according to the JRA
results. After updating the option list, the performance did become better
to some extend. But it is still not comparable with the Sun JVM. Maybe, it
is due to the use case with short duration and different implementation in
JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM
currently. Thanks all for your time and help.



what about flipping the switch that says run tasks in the TT's own 
JVM?. That should handle startup costs, and reduce the memory footprint


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Steve Loughran

Tom White wrote:

On Mon, May 18, 2009 at 11:44 AM, Steve Loughran ste...@apache.org wrote:

Grace wrote:

To follow up this question, I have also asked help on Jrockit forum. They
kindly offered some useful and detailed suggestions according to the JRA
results. After updating the option list, the performance did become better
to some extend. But it is still not comparable with the Sun JVM. Maybe, it
is due to the use case with short duration and different implementation in
JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM
currently. Thanks all for your time and help.


what about flipping the switch that says run tasks in the TT's own JVM?.
That should handle startup costs, and reduce the memory footprint



The property mapred.job.reuse.jvm.num.tasks allows you to set how many
tasks the JVM may be reused for (within a job), but it always runs in
a separate JVM to the tasktracker. (BTW
https://issues.apache.org/jira/browse/HADOOP-3675has some discussion
about running tasks in the tasktracker's JVM).

Tom


Tom,
that's why you are writing a book on Hadoop and I'm not ...you know the 
answers and I have some vague misunderstandings,


-steve
(returning to the svn book)


Re: public IP for datanode on EC2

2009-05-15 Thread Steve Loughran

Tom White wrote:

Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
Public and Elastic IP Data Transfer). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.



I thought that AWS had stopped you being able to talk to things within 
the cluster using the public IP addresses -stopped you using DynDNS as 
your way of bootstrapping discovery


Here's what may work
-bring up the EC2 cluster using the local names
-open up the ports
-have the clients talk using the public IP addresses

the problem will arise when the namenode checks the fs name used and it 
doesnt match its expectations -there were some recent patches in the 
code to handle this when someone talks to the namenode using the 
ipaddress instead of the hostname; they may work for this situation too.


personally, I wouldn't trust the NN in the EC2 datacentres to be secure 
to external callers, but that problem already exists within their 
datacentres anyway


Re: Winning a sixty second dash with a yellow elephant

2009-05-12 Thread Steve Loughran

Arun C Murthy wrote:

... oh, and getting it to run a marathon too!

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html 



Owen  Arun


Lovely. I will now stick up the pic of you getting the first results in 
on your laptop at apachecon


Re: Huge DataNode Virtual Memory Usage

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Raghu,

I don't actually have exact numbers from jmap, although I do remember that
jmap -histo reported something less than 256MB for this process (before I
restarted it).

I just looked at another DFS process that is currently running and has a VM
size of 1.5GB (~600 resident). Here jmap reports a total object heap usage
of 120MB. The memory block list reported by jmap pid doesn't actually seem
to contain the heap at all since the largest block in that list is 10MB in
size (/usr/java/jdk1.6.0_10/jre/lib/amd64/server/libjvm.so). However, pmap
reports a total usage of 1.56GB.

-- Stefan


you know, if you could get the Task Tracker to include stats on real and 
virtual memory use, I'm sure that others would welcome those reports 
-know that the job was slower and its VM was 2x physical would give you 
a good hint as to the root cause.


Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

zsongbo wrote:

Hi Stefan,
Yes, the 'nice' cannot resolve this problem.

Now, in my cluster, there are 8GB of RAM. My java heap configuration is:

HDFS DataNode : 1GB
HBase-RegionServer: 1.5GB
MR-TaskTracker: 1GB
MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)

But the memory usage is still tight.


does TT need to be so big if you are running all your work in external VMs?


Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Yes, I think the JVM uses way more memory than just its heap. Now some of it
might be just reserved memory, but not actually used (not sure how to tell
the difference). There are also things like thread stacks, jit compiler
cache, direct nio byte buffers etc. that take up process space outside of
the Java heap. But none of that should imho add up to Gigabytes...


good article on this
http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/



Re: Re-Addressing a cluster

2009-05-11 Thread Steve Loughran

jason hadoop wrote:

You should be able to relocate the cluster's IP space by stopping the
cluster, modifying the configuration files, resetting the dns and starting
the cluster.
Be best to verify connectivity with the new IP addresses before starting the
cluster.

to the best of my knowledge the namenode doesn't care about the ip addresses
of the datanodes, only what blocks they report as having. The namenode does
care about loosing contact with a connected datanode,  replicating the
blocks that are now under replicated.

I prefer IP addresses in my configuration files but that is a personal
preference not a requirement.



I do deployments on to Virtual clusters without fully functional reverse 
DNS, things do work badly in that situation. Hadoop assumes that if a 
machine looks up its hostname, it can pass that to peers and they can 
resolve it, the well managed network infrastructure assumption.


Re: datanode replication

2009-05-11 Thread Steve Loughran

Jeff Hammerbacher wrote:

Hey Vishal,

Check out the chooseTarget() method(s) of ReplicationTargetChooser.java in
the org.apache.hadoop.hdfs.server.namenode package:
http://svn.apache.org/viewvc/hadoop/core/trunk/src/hdfs/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java?view=markup
.

In words: assuming you're using the default replication level (3), the
default strategy will put one block on the local node, one on a node in a
remote rack, and another on that same remote rack.

Note that HADOOP-3799 (http://issues.apache.org/jira/browse/HADOOP-3799)
proposes making this strategy pluggable.



Yes, there's some good reasons for having different placement algorithms 
for different datacentres, and I could even imagine different MR 
sequences providing hints about where they want data, depending on what 
they want to do afterwards


Re: Re-Addressing a cluster

2009-05-11 Thread Steve Loughran

jason hadoop wrote:

Now that I think about it, the reverse lookups in my clusters work.


and you have made sure that IPv6 is turned off, right?




Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-08 Thread Steve Loughran

Grace wrote:

Thanks all for your replying.

I have run several times with different Java options for Map/Reduce
tasks. However there is no much difference.

Following is the example of my test setting:
Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages
-XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch
Test B: -Xmx1024m
Test C: -Xmx1024m -XXaggressive

Is there any tricky or special setting for Jrockit vm on Hadoop?

In the Hadoop Quick Start guides, it says that JavaTM 1.6.x, preferably
from Sun. Is there any concern about the Jrockit performance issue?



The main thing is that all the big clusters are running (as far as I know),
Linux (probably RedHat) and Sun Java. This is where the performance and 
scale testing is done. If you are willing to spend time doing the 
experiments and tuning, then I'm sure we can update those guides to say 
JRockit works, here are some options


-steve





Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Steve Loughran

Chris Collins wrote:
a couple of years back we did a lot of experimentation between sun's vm 
and jrocket.  We had initially assumed that jrocket was going to scream 
since thats what the press were saying.  In short, what we discovered 
was that certain jdk library usage was a little bit faster with jrocket, 
but for core vm performance such as synchronization, primitive 
operations the sun vm out performed.  We were not taking account of 
startup time, just raw code execution.  As I said, this was a couple of 
years back so things may of changed.


C


I run JRockit as its what some of our key customers use, and we need to 
test things. One lovely feature is tests time out before the stack runs 
out on a recursive operation; clearly different stack management at 
work. Another: no PermGenHeapSpace to fiddle with.


* I have to turn debug logging of in hadoop test runs, or there are 
problems.


* It uses short pointers (32 bits long) for near memory on a 64 bit JVM. 
So your memory footprint on sub-4GB VM images is better. Java7 promises 
this, and with the merger, who knows what we will see. This is 
unimportant  on 32-bit boxes


* debug single stepping doesnt work. That's ok, I use functional tests 
instead :)


I havent looked at outright performance.

/


Re: move tasks to another machine on the fly

2009-05-06 Thread Steve Loughran

Tom White wrote:

Hi David,

The MapReduce framework will attempt to rerun failed tasks
automatically. However, if a task is running out of memory on one
machine, it's likely to run out of memory on another, isn't it? Have a
look at the mapred.child.java.opts configuration property for the
amount of memory that each task VM is given (200MB by default). You
can also control the memory that each daemon gets using the
HADOOP_HEAPSIZE variable in hadoop-env.sh. Or you can specify it on a
per-daemon basis using the HADOOP_DAEMON_NAME_OPTS variables in the
same file.

Tom


This looks not so much a VM out of memory problem as OS thread 
provisioning. ulimit may be useful, as is the java -Xss option


http://candrews.integralblue.com/2009/01/preventing-outofmemoryerror-native-thread/



On Wed, May 6, 2009 at 1:28 AM, David Batista dsbati...@gmail.com wrote:

I get this error when running Reduce tasks on a machine:

java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:597)
   at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2591)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:387)
   at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117)
   at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
   at 
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)

is it possible to move a reduce task to other machine in the cluster on the fly?

--
./david





Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Steve Loughran

Bradford Stephens wrote:

Hey all,

I'm going to be speaking at OSCON about my company's experiences with
Hadoop and Friends, but I'm having a hard time coming up with a name
for the entire software ecosystem. I'm thinking of calling it the
Apache CloudStack. Does this sound legit to you all? :) Is there
something more 'official'?


We've been using Apache Cloud Computing Edition for this, to emphasise 
this is the successor to Java Enterprise Edition, and that it is cross 
language and being built at apache. If you use the same term, even if 
you put a different stack outline than us, it gives the idea more 
legitimacy.


The slides that Andrew linked to are all in SVN under
http://svn.apache.org/repos/asf/labs/clouds/

we have a space in the apache labs for apache clouds, where we want to 
do more work integrating things, and bringing the idea of deploy and 
test on someone else's infrastructure mainstream across all the apache 
products. We would welcome your involvement -and if you send a draft of 
your slides out, will happily review them


-steve


Re: I need help

2009-04-30 Thread Steve Loughran

Razen Alharbi wrote:

Thanks everybody,

The issue was that hadoop writes all the outputs to stderr instead of stdout
and i don't know why. I would really love to know why the usual hadoop job
progress is written to stderr.


because there is a line in log4.properties telling it to do just that?

log4j.appender.console.target=System.err


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: programming java ee and hadoop at the same time

2009-04-29 Thread Steve Loughran

Bill Habermaas wrote:
George, 


I haven't used the Hadoop perspective in Eclipse so I can't help with that
specifically but map/reduce is a batch process (and can be long running). In
my experience, I've written servlets that write to HDFS and then have a
background process perform the map/reduce. They can both run in background
under Eclipse but are not tightly coupled. 



I've discussed this recently, having a good binding from webapps to the 
back end, where the back end consists of HDFS, MapReduce queues, and 
things round them


https://svn.apache.org/repos/asf/labs/clouds/src/doc/apache_cloud_computing_edition_oxford.odp

If people are willing to help with this, we have an apache lab 
project, Apache Clouds, ready for you code, tests and ideas


Re: Can i make a node just an HDFS client to put/get data into hadoop

2009-04-29 Thread Steve Loughran

Usman Waheed wrote:

Hi All,

Is it possible to make a node just a hadoop client so that it can 
put/get files into HDFS but not act as a namenode or datanode?
I already have a master node and 3 datanodes but need to execute 
puts/gets into hadoop in parallel using more than just one machine other 
than the master.




Anything on the LAN can be a client of the filesystem, you just need 
appropriate hadoop configuration files to talk to the namenode and job 
tracker. I don't know how well the (custom) IPC works over long 
distances, and you have to keep the versions in sync for everything to 
work reliably.


Re: I need help

2009-04-28 Thread Steve Loughran

Razen Al Harbi wrote:

Hi all,

I am writing an application in which I create a forked process to execute a 
specific Map/Reduce job. The problem is that when I try to read the output 
stream of the forked process I get nothing and when I execute the same job 
manually it starts printing the output I am expecting. For clarification I will 
go through the simple code snippet:


Process p = rt.exec(hadoop jar GraphClean args);
BufferedReader reader = new BufferedReader(new 
InputStreamReader(p.getInputStream()));
String line = null;
check = true;
while(check){
line = reader.readLine();
if(line != null){// I know this will not finish it's only for testing.
System.out.println(line);
} 
}


If I run this code nothing shows up. But if execute the command (hadoop jar 
GraphClean args) from the command line it works fine. I am using hadoop 0.19.0.



Why not just invoke the Hadoop job submission calls yourself, no need to 
exec anything?


Look at org.apache.hadoop.util.RunJar to see what you need to do.

Avoid calling RunJar.main() directly as
 - it calls System.exit() when it wants to exit with an error
 - it adds shutdown hooks

-steve


Re: Storing data-node content to other machine

2009-04-28 Thread Steve Loughran

Vishal Ghawate wrote:

Hi,
I want to store the contents of all the client machine(datanode)of hadoop 
cluster to centralized machine
 with high storage capacity.so that tasktracker will be on the client machine 
but the contents are stored on the
centralized machine.
Can anybody help me on this please.



set the datanode to point to the (mounted) filesystem with the 
dfs.data.dir parameter.




Re: Processing High CPU Memory intensive tasks on Hadoop - Architecture question

2009-04-28 Thread Steve Loughran

Aaron Kimball wrote:

I'm not aware of any documentation about this particular use case for
Hadoop. I think your best bet is to look into the JNI documentation about
loading native libraries, and go from there.
- Aaron


You could also try

1. Starting the main processing app as a process on the machines -and 
leave it running-


2. have your mapper (somehow) talk to that running process, passing in 
parameters (including local filesystem filenames) to read and write.


You can use RMI or other IPC mechanisms to talk to the long-lived process.



Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread Steve Loughran

Stas Oskin wrote:

Hi.

2009/4/23 Matt Massie m...@cloudera.com


Just for clarity: are you using any type of virtualization (e.g. vmware,
xen) or just running the DataNode java process on the same machine?

What is fs.default.name set to in your hadoop-site.xml?




 This machine has OpenVZ installed indeed, but all the applications run
withing the host node, meaning all Java processes are running withing same
machine.


Maybe, but there will still be at least one virtual network adapter on 
the host. Try turning them off.




The fs.default.name is:
hdfs://192.168.253.20:8020


what happens if you switch to hostnames over IP addresses?


Re: RPM spec file for 0.19.1

2009-04-21 Thread Steve Loughran

Ian Soboroff wrote:

Steve Loughran ste...@apache.org writes:


I think from your perpective it makes sense as it stops anyone getting
itchy fingers and doing their own RPMs. 


Um, what's wrong with that?



It's reallly hard to do good RPM spec files. If cloudera are willing to 
pay Matt to do it, not only do I welcome it, but will see if I can help 
him with some of the automated test setup they'll need.


One thing which would be useful would be to package up all of the hadoop 
functional tests that need a live cluster up as its own JAR, so the test 
suite could be run against an RPM installation on different Virtual 
OS/JVM combos. I've just hit this problem with my own RPMs on Java6 
(java security related), so know that having the ability to use the 
entire existing test suite against an RPM installation would be be 
beneficial (both in my case and for hadoop RPMS)


Re: How many people is using Hadoop Streaming ?

2009-04-21 Thread Steve Loughran

Tim Wintle wrote:

On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:

  1) I can pick the language that offers a different programming
paradigm (e.g. I may choose functional language, or logic programming
if they suit the problem better).  In fact, I can even chosen Erlang
at the map() and Prolog at the reduce().  Mix and match can optimize
me more.


Agreed (as someone who has written mappers/reducers in Python, perl,
shell script and Scheme before).



sounds like a good argument for adding scripting support for in-JVM MR 
jobs; use the java6 scripting APIs and use any of the supported 
languages -java script out the box, other languages (jython, scala) with 
the right JARs.


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-21 Thread Steve Loughran

Andrew Newman wrote:

They are comparing an indexed system with one that isn't.  Why is
Hadoop faster at loading than the others?  Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does.  Who want to write a paper for next year, grep vs reverse
index?

2009/4/15 Guilherme Germoglio germog...@gmail.com:

(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/



I think it is interesting, though it misses the point that the reason 
that few datasets are 1PB today is nobody could afford to store or 
process the data. With Hadoop cost is somewhat high (learn to patch the 
source to fix your cluster's problems) but scales well with the #of 
nodes.  Commodity storage costs (my own home now has 2TB of storage) 
and commodity software costs are compatible.


Some other things to look at

-power efficiency. I actually think the DBs could come out better
-ease of writing applications by skilled developers. Pig vs SQL
-performance under different workloads (take a set of log files growing 
continually, mine it in near-real time. I think the last.fm use case 
would be a good one)



One of the great ironies of SQL is most developers dont go near it, as 
it is a detail handed by the O/R mapping engine, except when building 
SQL selects for web pages. If Pig makes M/R easy, would it be used -and 
if so, does that show that we developers prefer procedural thinking?


-steve





Re: Error reading task output

2009-04-21 Thread Steve Loughran

Cam Macdonell wrote:



Well, for future googlers, I'll answer my own post.  Watch our for the 
hostname at the end of localhost lines on slaves.  One of my slaves 
was registering itself as localhost.localdomain with the jobtracker.


Is there a way that Hadoop could be made to not be so dependent on 
/etc/hosts, but on more dynamic hostname resolution?




DNS is trouble in Java; there are some (outstanding) bugreps/hadoop 
patches on the topic, mostly showing up on a machine of mine with a bad 
hosts entry. I also encountered some fun last month with ubuntu linux 
adding the local hostname to /etc/hosts along the 127.0.0.1 entry, which 
is precisely what you dont want for a cluster of vms with no DNS at all. 
This sounds like your problem too, in which case I have shared your pain


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52



Re: getting DiskErrorException during map

2009-04-21 Thread Steve Loughran

Jim Twensky wrote:

Yes, here is how it looks:

property
namehadoop.tmp.dir/name
value/scratch/local/jim/hadoop-${user.name}/value
/property

so I don't know why it still writes to /tmp. As a temporary workaround, I
created a symbolic link from /tmp/hadoop-jim to /scratch/...
and it works fine now but if you think this might be a considered as a bug,
I can report it.


I've encountered this somewhere too; could be something is using the 
java temp file API, which is not what you want. Try setting 
java.io.tmpdir to /scratch/local/tmp just to see if that makes it go away





Re: Error reading task output

2009-04-21 Thread Steve Loughran

Aaron Kimball wrote:

Cam,

This isn't Hadoop-specific, it's how Linux treats its network configuration.
If you look at /etc/host.conf, you'll probably see a line that says order
hosts, bind -- this is telling Linux's DNS resolution library to first read
your /etc/hosts file, then check an external DNS server.

You could probably disable local hostfile checking, but that means that
every time a program on your system queries the authoritative hostname for
localhost, it'll go out to the network. You'll probably see a big
performance hit. The better solution, I think, is to get your nodes'
/etc/hosts files squared away.


I agree


You only need to do so once :)


No, you need to detect whenever the Linux networking stack has decided 
to add new entries to resolv.conf  or /etc/hosts and detect when they 
are inappropriate. Which is a tricky thing to do as there are some cases 
where you may actually be grateful that someone in the debian codebase 
decided that adding the local hostname as 127.0.0.1 is actually a 
feature. I ended up writing a new SmartFrog component that can be 
configured to fail to start if the network is a mess, which is something 
worth pushing out.


as part of hadoop diagnostics, this test would be one of the things to 
deal with and at least warn on. your hostname is local, you will not be 
visible over the network.


-steve



Re: RPM spec file for 0.19.1

2009-04-03 Thread Steve Loughran

Ian Soboroff wrote:

I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
with a spec file for building a 0.19.1 RPM.

I like the idea of Cloudera's RPM file very much.  In particular, it has
nifty /etc/init.d scripts and RPM is nice for managing updates.
However, it's for an older, patched version of Hadoop.

This spec file is actually just Cloudera's, with suitable edits.  The
spec file does not contain an explicit license... if Cloudera have
strong feelings about it, let me know and I'll pull the JIRA attachment.

The JIRA includes instructions on how to roll the RPMs yourself.  I
would have attached the SRPM but they're too big for JIRA.  I can offer
noarch RPMs build with this spec file if someone wants to host them.

Ian



-RPM and deb packaging would be nice

-the .spec file should be driven by ant properties to get dependencies 
from the ivy files
-the jdk requirements are too harsh as it should run on openjdk's JRE or 
jrockit; no need for sun only. Too bad the only way to say that is leave 
off all jdk dependencies.
-I worry about how they patch the rc.d files. I can see why, but wonder 
what that does with the RPM ownership


As someone whose software does get released as RPMs (and tar files 
containing everything needed to create your own), I can state with 
experience that RPMs are very hard to get right, and very hard to test. 
The hardest thing to get right (and to test) is live update of the RPMs 
while the app is running. I am happy for the cloudera team to have taken 
on this problem.


Re: Using HDFS to serve www requests

2009-04-03 Thread Steve Loughran

Snehal Nagmote wrote:

can you please explain exactly adding NIO bridge means what and how it can be
done , what could 
be advantages in this case ?  


NIO: java non-blocking IO. It's a standard API to talk to different 
filesystems; support has been discussed in jira. If the DFS APIs were 
accessible under an NIO front end, then applications written for the NIO 
APIs would work with the supported filesystems, with no need to code 
specifically for hadoop's not-yet-stable APIs







Steve Loughran wrote:

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.







Re: Amazon Elastic MapReduce

2009-04-03 Thread Steve Loughran

Brian Bockelman wrote:


On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:

seems like I should pay for additional money, so why not configure a 
hadoop

cluster in EC2 by myself. This already have been automatic using script.




Not everyone has a support team or an operations team or enough time to 
learn how to do it themselves.  You're basically paying for the fact 
that the only thing you need to know to use Hadoop is:

1) Be able to write the Java classes.
2) Press the go button on a webpage somewhere.

You could use Hadoop with little-to-zero systems knowledge (and without 
institutional support), which would always make some researchers happy.


Brian


True, but this way nobody gets the opportunity to learn how to do it 
themselves, which can be a tactical error one comes to regret further 
down the line. By learning the pain of cluster management today, you get 
to keep it under control as your data grows.


I am curious what bug patches AWS will supply, for they have been very 
silent on their hadoop work to date.


Re: Typical hardware configurations

2009-03-31 Thread Steve Loughran

Scott Carey wrote:

On 3/30/09 4:41 AM, Steve Loughran ste...@apache.org wrote:


Ryan Rawson wrote:


You should also be getting 64-bit systems and running a 64 bit distro on it
and a jvm that has -d64 available.

For the namenode yes. For the others, you will take a fairly big memory
hit (1.5X object size) due to the longer pointers. JRockit has special
compressed pointers, so will JDK 7, apparently.



Sun Java 6 update 14 has ³Ordinary Object Pointer² compression as well.
-XX:+UseCompressedOops.  I¹ve been testing out the pre-release of that with
great success.


Nice. Have you tried Hadoop with it yet?



Jrockit has virtually no 64 bit overhead up to 4GB, Sun Java 6u14 has small
overhead up to 32GB with the new compression scheme.  IBM¹s VM also has some
sort of pointer compression but I don¹t have experience with it myself.


I use the JRockit JVM as it is what our customers use and we need to 
test on the same JVM. It is interesting in that recursive calls don't 
ever seem to run out; the way it does stack doesn't have separate memory 
spaces for stack, permanent generation heap space and the like.



That doesn't mean apps are light: a freshly started IDE consumes more 
physical memory than a VMWare image running XP and outlook. But it is 
fairly responsive, which is good for a UI:

2295m 650m  22m S2 10.9   0:43.80 java
855m 543m 530m S   11  9.1   4:40.40 vmware-vmx




http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/tag/compressed-oops/
 
With pointer compression, there may be gains to be had with running 64 bit

JVMs smaller than 4GB on x86 since then the runtime has access to native 64
bit integer operations and registers (as well as 2x the register count).  It
will be highly use-case dependent.


that would certainly benefit atomic operations on longs; for floating 
point math it would be less useful as JVMs have long made use of the SSE 
register set for FP work. 64 bit registers would make it easier to move 
stuff in and out of those registers.


I will try and set up a hudson server with this update and see how well 
it behaves.


Re: JNI and calling Hadoop jar files

2009-03-30 Thread Steve Loughran

jason hadoop wrote:

The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*,
implies strongly that a hadoop-default.xml file, or at least a  job.xml file
is present.
Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the
assumption is that the core jar is available.
The class not found exception, the implication is that the
hadoop-0.X.Y-core.jar is not available to jni.

Given the above constraints, the two likely possibilities are that the -core
jar is unavailable or damaged, or that somehow the classloader being used
does not have access to the -core  jar.

A possible reason for the jar not being available is that the application is
running on a different machine, or as a different user and the jar is not
actually present or perhaps readable in the expected location.





Which way is your JNI, java application calling into a native shared
library, or a native application calling into a jvm that it instantiates via
libjvm calls?

Could you dump the classpath that is in effect before your failing jni call?
System.getProperty( java.class.path), and for that matter,
java.library.path, or getenv(CLASSPATH)
and provide an ls -l of the core.jar from the class path, run as the user
that owns the process, on the machine that the process is running on.



Or something bad is happening with a dependent library of the filesystem 
that is causing the reflection-based load to fail and die with the root 
cause being lost in the process. Sometimes putting an explicit reference 
to the class you are trying to load is a good way to force the problem 
to surface earlier, and fail with better error messages.


Re: virtualization with hadoop

2009-03-30 Thread Steve Loughran

Oliver Fischer wrote:

Hello Vishal,

I did the same some weeks ago. The most important fact is, that it
works. But it is horrible slow if you not have enough ram and multiple
disks since all I/o-Operations go to the same disk.


they may go to separate disks underneath, but performance is bad as what 
the virtual OS thinks is a raw hard disk could be a badly fragmented bit 
of storage on the container OS.


Memory is another point of conflict; your VMs will swap out or block 
other vms.


0. Keep different VM virtual disks on different physical disks. Fast 
disks at that.

1. pre-allocate your virtual disks
2. defragment at both the VM and host OS levels.
3. Crank back the schedulers so that the VMs aren't competing too much 
for CPU time. One core for the host OS, one for each VM.
4. You can keep an eye on performance by looking at the clocks of the 
various machines: if they pause and get jittery then they are being 
swapped out.


Using multiple VMs on a single host is OK for testing, but not for hard 
work. You can use VM images to do work, but you need to have enough 
physical cores and RAM to match that of the VMs.


-steve







Re: Typical hardware configurations

2009-03-30 Thread Steve Loughran

Ryan Rawson wrote:


You should also be getting 64-bit systems and running a 64 bit distro on it
and a jvm that has -d64 available.


For the namenode yes. For the others, you will take a fairly big memory 
hit (1.5X object size) due to the longer pointers. JRockit has special 
compressed pointers, so will JDK 7, apparently.





Re: Using HDFS to serve www requests

2009-03-30 Thread Steve Loughran

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.



If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.


Re: Persistent HDFS On EC2

2009-03-12 Thread Steve Loughran

Kris Jirapinyo wrote:

Why would you lose the locality of storage-per-machine if one EBS volume is
mounted to each machine instance?  When that machine goes down, you can just
restart the instance and re-mount the exact same volume.  I've tried this
idea before successfully on a 10 node cluster on EC2, and didn't see any
adverse performance effects--



I was thinking more of S3 FS, which is remote-ish and write times measurable


and actually amazon claims that EBS I/O should
be even better than the instance stores.  


Assuming the transient filesystems are virtual disks (and not physical 
disks that get scrubbed, formatted and mounted on every VM 
instantiation), and also assuming that EBS disks are on a SAN in the 
same datacentre, this is probably true. Disk IO performance in virtual 
disks is currently pretty slow as you are navigating through 1 
filesystem, and potentially seeking at lot, even something that appears 
unfragmented at the VM level





The only concerns I see are that

you need to pay for EBS storage regardless of whether you use that storage
or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
starting out with your cluster so you're using only 50GB on each EBS volume
so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
and that could be a hefty price for each month.  Also, currently EBS needs
to be created in the same availability zone as your instances, so you need
to make sure that they are created correctly, as there is no direct
migration of EBS to different availability zones.


View EBS as renting space in SAN and it starts to make sense.


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Extending ClusterMapReduceTestCase

2009-03-12 Thread Steve Loughran

jason hadoop wrote:

I am having trouble reproducing this one. It happened in a very specific
environment that pulled in an alternate sax parser.

The bottom line is that jetty expects a parser with particular capabilities
and if it doesn't get one, odd things happen.

In a day or so I will have hopefully worked out the details, but it has been
have a year since I dealt with this last.

Unless you are forking, to run your junit tests, ant won't let you change
the class path for your unit tests - much chaos will ensue.


Even if you fork, unless you set includeantruntime=false then you get 
Ant's classpath, as the junit test listeners are in the 
ant-optional-junit.jar and you'd better pull them in somehow.


I can see why AElfred would cause problems for jetty; they need to 
handle web.xml and suchlike, and probably validate them against the 
schema to reduce support calls.




Re: using virtual slave machines

2009-03-12 Thread Steve Loughran

Karthikeyan V wrote:

There is no specific procedure for configuring virtual machine slaves.



make sure the following thing are done.



I've used these as the beginning of a page on this

http://wiki.apache.org/hadoop/VirtualCluster




Re: Persistent HDFS On EC2

2009-03-11 Thread Steve Loughran

Malcolm Matalka wrote:

If this is not the correct place to ask Hadoop + EC2 questions please
let me know.

 


I am trying to get a handle on how to use Hadoop on EC2 before
committing any money to it.  My question is, how do I maintain a
persistent HDFS between restarts of instances.  Most of the tutorials I
have found involve the cluster being wiped once all the instances are
shut down but in my particular case I will be feeding output of a
previous days run as the input of the current days run and this data
will get large over time.  I see I can use s3 as the file system, would
I just create an EBS  volume for each instance?  What are my options?


 EBS would cost you more; you'd lose the locality of storage-per-machine.

If you stick the output of some runs back into S3 then the next jobs 
have no locality and higher startup overhead to pull the data down, but 
you dont pay for that download (just the time it takes).


Re: master trying fetch data from slave using localhost hostname :)

2009-03-09 Thread Steve Loughran

pavelkolo...@gmail.com wrote:
On Fri, 06 Mar 2009 14:41:57 -, jason hadoop 
jason.had...@gmail.com wrote:


I see that when the host name of the node is also on the localhost 
line in

/etc/hosts



I erased all records with localhost from all /etc/hosts files and 
all fine now :)

Thank you :)



what does /etc/host look like now?

I hit some problems with ubuntu and localhost last week; the hostname 
was set up in /etc/hosts not just to point to the loopback address, but 
to a different loopback address (127.0.1.1) from the normal value 
(127.0.0.1), so breaking everything.


http://www.1060.org/blogxter/entry?publicid=121ED68BB21DB8C060FE88607222EB52


Re: [ANNOUNCE] Hadoop release 0.19.1 available

2009-03-03 Thread Steve Loughran

Aviad sela wrote:

Nigel Thanks,
I have extracted the new project.

However, I am having problems building the project
I am using Eclipse 3.4
and ant 1.7

I recieve error compiling core classes
*

compile-core-classes*:

BUILD FAILED

jsp-compile

uriroot=${src.webapps}/task

outputdir=${build.src}

package=org.apache.hadoop.mapred

webxml=${build.webapps}/task/WEB-INF/web.xml

/jsp-compile
*

D:\Work\AviadWork\workspace\cur\WSAD\Hadoop_Core_19_1\Hadoop\build.xml:302:
java.lang.ExceptionInInitializerError

*
it points to the the webxml tag



Try an ant -verbose and post the full log, we may be able to look at the 
problem more. Also, run an


ant -diagnostics

and include what it prints

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: the question about the common pc?

2009-02-23 Thread Steve Loughran

Tim Wintle wrote:

On Fri, 2009-02-20 at 13:07 +, Steve Loughran wrote:
I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.


I've got some (mainly python) scripts (that will probably be run with
hadoop streaming eventually) that I run over multiple cpus/cores on a
single machine by opening the appropriate number of named pipes and
using tee and awk to split the workload

something like


mkfifo mypipe1
mkfifo mypipe2
awk '0 == NR % 2'  mypipe1 | ./mapper | sort  map_out_1

  awk '0 == (NR+1) % 2'  mypipe2 | ./mapper | sort  map_out_2

./get_lots_of_data | tee mypipe1  mypipe2


(wait until it's done... or send a signal from the get_lots_of_data
process on completion if it's a cronjob)


sort -m map_out* | ./reducer  reduce_out


works around the global interpreter lock in python quite nicely and
doesn't need people that write the scripts (who may not be programmers)
to understand multiple processes etc, just stdin and stdout.



Dumbo provides py support under Hadoop:
 http://wiki.github.com/klbostee/dumbo
 https://issues.apache.org/jira/browse/HADOOP-4304

as well as that, given Hadoop is java1.6+, there's no reason why it 
couldn't support the javax.script engine, with JavaScript working 
without extra JAR files, groovy and jython once their JARs were stuck on 
the classpath. Some work would probably be needed to make it easier to 
use these languages, and then there are the tests...


Re: GenericOptionsParser warning

2009-02-20 Thread Steve Loughran

Rasit OZDAS wrote:

Hi,
There is a JIRA issue about this problem, if I understand it correctly:
https://issues.apache.org/jira/browse/HADOOP-3743

Strange, that I searched all source code, but there exists only this control
in 2 places:

if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) {
  LOG.warn(Use GenericOptionsParser for parsing the arguments.  +
   Applications should implement Tool for the same.);
}

Just an if block for logging, no extra controls.
Am I missing something?

If your class implements Tool, than there shouldn't be a warning.


OK, for my automated submission code I'll just set that switch and I 
won't get told off.


Re: the question about the common pc?

2009-02-20 Thread Steve Loughran

?? wrote:

Actually, there's a widely misunderstanding of this Common PC . Common PC 
doesn't means PCs which are daily used, It means the performance of each node, can be 
measured by common pc's computing power.

In the matter of fact, we dont use Gb enthernet for daily pcs' communication, we dont use 
linux for our document process, and most importantly, Hadoop cannot run effectively on 
thoese daily pcs.

 
Hadoop is designed for High performance computing equipment, but claimed to be fit for daily pcs.


Hadoop for pcs? what a joke.


Hadoop is designed to build a high throughput dataprocessing 
infrastructure from commodity PC parts. SATA not RAID or SAN, x68+linux 
not supercomputer hardware and OS. You can bring it up on lighter weight 
systems, but it has a minimium overhead that is quite steep for small 
datasets. I've been doing MapReduce work over small in-memory datasets 
using Erlang,  which works very well in such a context.


-you need a good network, with DNS working (fast), good backbone and 
switches

-the faster your disks, the better your throughput
-ECC memory makes a lot of sense
-you need a good cluster management setup unless you like SSH-ing to 20 
boxes to find out which one is playing up


Re: How to use Hadoop API to submit job?

2009-02-20 Thread Steve Loughran

Wu Wei wrote:

Hi,

I used to submit Hadoop job with the utility RunJar.main() on hadoop 
0.18. On hadoop 0.19, because the commandLineConfig of JobClient was 
null, I got a NullPointerException error when RunJar.main() calls 
GenericOptionsParser to get libJars (0.18 didn't do this call). I also 
tried the class JobShell to submit job, but it catches all exceptions 
and sends to stderr so that I cann't handle the exceptions myself.


I noticed that if I can call JobClient's setCommandLineConfig method, 
everything goes easy. But this method has default package accessibility, 
I cann't see the method out of package org.apache.hadoop.mapred.


Any advices on using Java APIs to submit job?

Wei


Looking at my code, the line that does the work is

JobClient jc = new JobClient(jobConf);
runningJob = jc.submitJob(jobConf);

My full (LGPL) code is here : http://tinyurl.com/djk6vj

there's more work with validating input and output directories, pulling 
back the results, handling timeouts if the job doesnt complete, etc,etc, 
but that's feature creep


Re: GenericOptionsParser warning

2009-02-18 Thread Steve Loughran

Sandhya E wrote:

Hi All

I prepare my JobConf object in a java class, by calling various set
apis in JobConf object. When I submit the jobconf object using
JobClient.runJob(conf), I'm seeing the warning:
Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same. From hadoop sources it looks like
setting mapred.used.genericoptionsparser will prevent this warning.
But if I set this flag to true, will it have some other side effects.

Thanks
Sandhya


Seen this message too -and it annoys me; not tracked it down


Re: HDFS architecture based on GFS?

2009-02-17 Thread Steve Loughran

Amr Awadallah wrote:

I didn't understand usage of malicuous here,
but any process using HDFS api should first ask NameNode where the


Rasit,

  Matei is referring to fact that a malicious peace of code can bypass the
Name Node and connect to any data node directly, or probe all data nodes for
that matter. There is no strong authentication for RPC at this layer of
HDFS, which is one of the current shortcomings that will be addressed in
hadoop 1.0.

-- amr


This shouldn't be a problem in a locked down datacentre. Where it is a 
risk is when you host on EC2 or other Vm-hosting service that doesn't 
set up private VPNs


Re: datanode not being started

2009-02-17 Thread Steve Loughran

Sandy wrote:



Since I last used this machine, Parallels Desktop was installed by the
admin. I am currently suspecting that somehow this is interfering with the
function of Hadoop  (though Java_HOME still seems to be ok). Has anyone had
any experience with this being a cause of interference?



It could have added 1 virtual network adapter, and hadoop is starting 
on the wrong adapter.


I dont think Hadoop handles this situation that well (yet), as you
 -need to be able to specify the adapter for every node
 -get rid of the notion of I have a hostname and move to every 
network adapter has its own hostname
I haven't done enough experiments to be sure. I do know that if you 
start a datanode with IP addresses for the filesystem, it works out the 
hostname and then complains if anyone tries to talk to it using the same 
ip address URL it booted with.


-steve


Re: HADOOP-2536 supports Oracle too?

2009-02-17 Thread Steve Loughran

sandhiya wrote:

Hi,
I'm using postgresql and the driver is not getting detected. How do you run
it in the first place? I just typed 


bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in my
Libraries folder (I'm using NetBeans).
please help




JDBC drivers need to be (somehow) loaded before you can resolve the 
relevant jdbc urls; somehow your code needs to call 
Class.forName(jdbcdrivername), where  that string is set to the 
relevant jdbc driver classname


Re: stable version

2009-02-16 Thread Steve Loughran

Anum Ali wrote:

The parser problem is related to jar files , can be resolved not a bug.

Forwarding link for its solution


http://www.jroller.com/navanee/entry/unsupportedoperationexception_this_parser_does_not



this site is  down; cant see it

It is a bug, because I view all operations problems as defects to be 
opened in the bug tracker, stack traces stuck in, the problem resolved. 
That's software or hardware -because that issue DB is your searchable 
history of what went wrong. Given on my system I was seeing a 
ClassNotFoundException for loading FSConstants, there was no easy way to 
work out what went wrong, and its cost me a couple of days work.


furthermore, in the OSS world, every person who can't get your app to 
work is either going to walk away unhappy (=lost customer, lost 
developer and risk they compete with you), or they are going to get on 
the email list and ask questions, questions which may get answered, but 
it will cost them time.


Hence
* happyaxis.jsp: axis' diagnostics page, prints out useful stuff and 
warns if it knows it is unwell (and returns 500 error code so your 
monitoring tools can recognise this)
* ant -diagnostics: detailed look at your ant system including xml 
parser experiments.


Good open source tools have to be easy for people to get started with, 
and that means helpful error messages. If we left the code alone, 
knowing that the cause of a ClassNotFoundException was the fault of the 
user sticking the wrong XML parser on the classpath -and yet refusing to 
add the four lines of code needed to handle this- then we are letting 
down the users





On 2/13/09, Steve Loughran ste...@apache.org wrote:

Anum Ali wrote:

 This only occurs in linux , in windows its  fine.

do a java -version for me, and an ant -diagnostics, stick both on the bugrep

https://issues.apache.org/jira/browse/HADOOP-5254

It may be that XInclude only went live in java1.6u5; I'm running a
JRockit JVM which predates that and I'm seeing it (linux again);

I will also try sticking xerces on the classpath to see what happens next





Re: Namenode not listening for remote connections to port 9000

2009-02-13 Thread Steve Loughran

Michael Lynch wrote:

Hi,

As far as I can tell I've followed the setup instructions for a hadoop 
cluster to the letter,
but I find that the datanodes can't connect to the namenode on port 9000 
because it is only

listening for connections from localhost.

In my case, the namenode is called centos1, and the datanode is called 
centos2. They are

centos 5.1 servers with an unmodified sun java 6 runtime.


fs.default.name takes a URL to the filesystem. such as hdfs://centos1:9000/

If the machine is only binding to localhost, that may mean DNS fun. Try 
a fully qualified name instead


Re: stable version

2009-02-13 Thread Steve Loughran

Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?






I'm seeing this too, filed the bug

https://issues.apache.org/jira/browse/HADOOP-5254

Any stack traces you can add on will help. Probable cause is 
https://issues.apache.org/jira/browse/HADOOP-4944


Re: stable version

2009-02-13 Thread Steve Loughran

Anum Ali wrote:

 This only occurs in linux , in windows its  fine.


do a java -version for me, and an ant -diagnostics, stick both on the bugrep

https://issues.apache.org/jira/browse/HADOOP-5254

It may be that XInclude only went live in java1.6u5; I'm running a 
JRockit JVM which predates that and I'm seeing it (linux again);


I will also try sticking xerces on the classpath to see what happens next


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.




You are using Java6, correct?


Re: Best practices on spliltting an input line?

2009-02-12 Thread Steve Loughran

Stefan Podkowinski wrote:

I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. a,a,b,c.


I've used it in the past; found it pretty reliable. Again, no perf 
tests, just reading in CSV files exported from spreadsheets


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?





well, in that case something being passed down to setXIncludeAware may 
be picked up as invalid. More of a stack trace may help. Otherwise, now 
is your chance to learn your way around the hadoop codebase, and ensure 
that when the next version ships, your most pressing bugs have been fixed


Re: anybody knows an apache-license-compatible impl of Integer.parseInt?

2009-02-11 Thread Steve Loughran

Zheng Shao wrote:

We need to implement a version of Integer.parseInt/atoi from byte[] instead of 
String to avoid the high cost of creating a String object.

I wanted to take the open jdk code but the license is GPL:
http://www.docjar.com/html/api/java/lang/Integer.java.html

Does anybody know an implementation that I can use for hive (apache license)?


I also need to do it for Byte, Short, Long, and Double. Just don't want to go 
over all the corner cases.


Use the Apache Harmony code
http://svn.apache.org/viewvc/harmony/enhanced/classlib/branches/java6/modules/


Re: File Transfer Rates

2009-02-11 Thread Steve Loughran

Brian Bockelman wrote:
Just to toss out some numbers (and because our users are making 
interesting numbers right now)


Here's our external network router: 
http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets 



Here's the application-level transfer graph: 
http://t2.unl.edu/phedex/graphs/quantity_rates?link=srcno_mss=trueto_node=Nebraska 



In a squeeze, we can move 20-50TB / day to/from other heterogenous 
sites.  Usually, we run out of free space before we can find the upper 
limit for a 24-hour period.


We use a protocol called GridFTP to move data back and forth between 
external (non-HDFS) clusters.  The other sites we transfer with use 
niche software you probably haven't heard of (Castor, DPM, and dCache) 
because, well, it's niche software.  I have no available data on 
HDFS-S3 systems, but I'd again claim it's mostly a function of the 
amount of hardware you throw at it and the size of your network pipes.


There are currently 182 datanodes; 180 are traditional ones of 3TB 
and 2 are big honking RAID arrays of 40TB.  Transfers are load-balanced 
amongst ~ 7 GridFTP servers which each have 1Gbps connection.




GridFTP is optimised for high bandwidth network connections with 
negotiated packet size and multiple TCP connections, so when nagel's 
algorithm triggers backoff from a dropped packet, only a fraction of the 
transmission gets dropped. It is probably best-in-class for long haul 
transfers over the big university backbones where someone else pays for 
your traffic. You would be very hard pressed to get even close to that 
on any other protocol.


I have no data on S3 xfers other than hearsay
 * write time to S3 can be slow as it doesn't return until the data is 
persisted somewhere. That's a better guarantee than a posix write 
operation.
 * you have to rely on other people on your rack not wanting all the 
traffic for themselves. That's an EC2 API issue: you don't get to 
request/buy bandwidth to/from S3


One thing to remember is that if you bring up a Hadoop cluster on any 
virtual server farm, disk IO is going to be way below physical IO rates. 
Even when the data is in HDFS, it will be slower to get at than 
dedicated high-RPM SCSI or SATA storage.


Re: settin JAVA_HOME...

2009-02-02 Thread Steve Loughran

haizhou zhao wrote:

hi Sandy,
Every time I change the conf, i have to do the following to things:
1. kill all hadoop processes
2. manually delelte all the file under hadoop.tmp.dir
to make sure hadoop runs correctly, otherwise it wont work.

Is this cause'd by my using a JDK instead of sun java? 



No, you need to do that to get configuration changes picked up. There 
are scripts in hadoop/bin to help you



and what do you mean
by sun-java, please?



Sandy means

* sun-java6-jdk: Sun's released JDK
* default-jdk ubuntu chooses. On 8.10, it is open-jdk
* open-jdk-6-jdk: the full open source version of the JDK. Worse font 
rendering code, but comes with more source


Others
* Oracle JRockit: good 64-bit memory management, based on the sun JDK 
unsupported

* IBM JVM unsupported. Based on the sun JDK
* Apache Harmony: clean room rewrite of everything. unsupported
* Kaffe. unsupported
* Gcj. unsupported

type java -version to get your java version


Sun

java version 1.6.0_10
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Server VM (buld 11.0-b14, mixed mode

JRockit:

java version 1.6.0_02
Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
BEA JRockit(R) (build 
R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64, compiled mode)









2009/1/31 Sandy snickerdoodl...@gmail.com


Hi Zander,

Do not use jdk. Horrific things happen. You must use sun java in order to
use hadoop.





Re: tools for scrubbing HDFS data nodes?

2009-01-29 Thread Steve Loughran

Sriram Rao wrote:

Does this read every block of every file from all replicas and verify
that the checksums are good?

Sriram


The DataBlockScanner thread on every datanode does this for you 
automatically. You can tune the rate it reads it, but it reads in all 
local blocks and compares the MD5 sums, deals with failures by reporting 
a list of failures to the namenode after the scan. After that, it's the 
namenode's problem how to deal with the corrupt block. In an ideal 
system, at least one non-corrupt copy of the block is still live


the configuration attribute dfs.datanode.scan.period.hours can tune the 
scan rate


Re: Cannot run program chmod: error=12, Not enough space

2009-01-29 Thread Steve Loughran

Andy Liu wrote:

I'm running Hadoop 0.19.0 on Solaris (SunOS 5.10 on x86) and many jobs are
failing with this exception:


This isnt disk space, this is a RAM/swap  problem related to fork()
. Search for that error string on hadoop JIRA; I think it's been seen 
before on linux, though there the namenode was at fault


Error initializing attempt_200901281655_0004_m_25_0:
java.io.IOException: Cannot run program chmod: error=12, Not enough space
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:338)
at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:532)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:274)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at 
org.apache.hadoop.mapred.MapTask.localizeConfiguration(MapTask.java:107)
at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:1803)
at 
org.apache.hadoop.mapred.TaskTracker$TaskInProgress.launchTask(TaskTracker.java:1884)
at 
org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:784)
at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:778)
at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1636)
at 
org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:102)
at 
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1602)
Caused by: java.io.IOException: error=12, Not enough space
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 20 more

However, all the disks have plenty of disk space left (over 800 gigs).  Can
somebody point me in the right direction?

Thanks,
Andy




--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Zeroconf for hadoop

2009-01-27 Thread Steve Loughran

Edward Capriolo wrote:

Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.



-1 to zeroconf as it is way too chatty. Every DNS lookup is mcast, in a 
busy network a lot of CPU time is spent discarding requests. Nor does it 
handle failure that well. It's OK on a home LAN to find a music player, 
but not what you want for a HA infrastructure in the datacentre,


Our LAN discovery tool -Anubis -uses mcast only to do the initial 
discovery, then they have voting and things to select a nominated server 
that everyone just unicasts too at that point; failure of that 
node/network partition triggers a rebinding.


See: http://wiki.smartfrog.org/wiki/display/sf/Anubis  ; the paper 
discusses some of the fun you have, though that paper doesn't also 
include clock drift issue you can encounter when running Xen or 
VMWare-hosted nodes.




I want to mention a related topic to the list. People are approaching
the auto-discovery in a number of ways jira. There are a few ways I
can think of to discover hadoop. A very simple way might be to publish
the configuration over a web interface. I use a network storage system
called gluster-fs. Gluster can be configured so the server holds the
configuration for each client. If the hadoop name node held the entire
configuration for all the nodes the namenode would only need to be
aware of the namenode and it could retrieve its configuration from it.

Having a central configuration management or a discovery system would
be very useful. HOD is what I think to be the closest thing it is more
of a top down deployment system.


Allen is a fan of a well managed cluster; he pushes out Hadoop as RPMs 
via PXE and Kickstart and uses LDAP as the central CM tool. I am 
currently exploring bringing up virtual clusters by
 * putting the relevant RPMs out to all nodes; same files/conf for 
every node,
 * having custom configs for Namenode and job tracker; everything else 
becomes a Datanode with a task tracker bound to the masters.
I will start worrying about discovery afterwards, because without the 
ability for the Job Tracker or Namenode to do failover to a fallback Job 
Tracker or Namenode, you don't really need so much in the way of dynamic 
cluster binding.


-steve


Re: HDFS - millions of files in one directory?

2009-01-26 Thread Steve Loughran

Philip (flip) Kromer wrote:

I ran in this problem, hard, and I can vouch that this is not a windows-only
problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
than a few hundred thousand files in the same directory. (The operation to
correct this mistake took a week to run.)  That is one of several hard
lessons I learned about don't write your scraper to replicate the path
structure of each document as a file on disk.


I've seen a fair few machines (one of the network store programs) top 
out at 65K files/dir; shows while it is good to test your assumptions 
before you go live.




Re: running hadoop on heterogeneous hardware

2009-01-22 Thread Steve Loughran

Bill Au wrote:

Is hadoop designed to run on homogeneous hardware only, or does it work just
as well on heterogeneous hardware as well?  If the datanodes have different
disk capacities, does HDFS still spread the data blocks equally amount all
the datanodes, or will the datanodes with high disk capacity end up storing
more data blocks?  Similarily, if the tasktrackres have different numbers of
CPUs, is there a way to configure hadoop to run more tasks on those
tasktrackers that have more CPUs?  Is that simply a matter of setting
mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum differently on the tasktrackers?

Bill



Life is simpler on homogenous boxes; by setting the maximum tasks 
differently for the different machines, you do limit the amount of work 
that gets pushed out to those boxes. More troublesome is slower 
CPUs/HDDs, they arent picked up directly, though speculative work can 
handle some of this


One interesting bit of research would be something adaptive; something 
to monitor throughput and tune those values based on performance; that 
would detect variations in a cluster and work with with it, rather than 
requiring you  to know the capabilities of every machine.


-steve


Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran

Amit k. Saha wrote:

On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
matthias.sche...@1und1.de wrote:

Hi all,

we've made our first steps in evaluating hadoop. The setup of 2 VMs as a
hadoop grid was very easy and works fine.

Now our operations team wonders why hadoop has to be able to connect to
the master and slaves via password-less ssh?! Can anyone give us an
answer to this question?


1. There has to be a way to connect to the remote hosts- slaves and a
secondary master, and SSH is the secure way to do it
2. It has to be password-less to enable automatic logins



SSH is *a * secure way to do it, but not the only way. Other management 
tools can bring up hadoop clusters. Hadoop ships with scripted support 
for SSH as it is standard with Linux distros and generally the best way 
to bring up a remote console.


Matthias,
Your ops team should not be worrying about the SSH security, as long as 
they keep their keys under control.


(a) Key-based SSH is more secure than passworded SSH, as man-in-middle 
attacks are prevented. passphrase protected SSH keys on external USB 
keys even better.


(b) once the cluster is up, that filesystem is pretty vulnerable to 
anything on the LAN. You do need to lock down your datacentre, or set up 
the firewall/routing of the servers so that only trusted hosts can talk 
to the FS. SSH becomes a detail at that point.





Re: AW: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Steve Loughran

Matthias Scherer wrote:

Hi Steve and Amit,

Thanks for your answers. I agree with you that key-based ssh is nothing to 
worry about. But I'm wondering what exactly - that means wich grid 
administration tasks - hadoop does via ssh?! Does it restart crashed data nodes 
or tasks trackers on the slaves? Oder does it transfer data over the grid with 
ssh access? How can I find a short description what exactly hadoop needs ssh 
for? The documentation says only that I have to configure it.

Thanks  Regards
Matthias



SSH is used by the various scripts in bin/ to start and stop clusters, 
slaves.sh does the work, the other ones (like hadoop-daemons.sh) use it 
to run stuff on the machines.


The EC2 scripts use SSH to talk to the machines brought up there; when 
you ask amazon for machines, you give it a public key to be set to the 
allowed keys list of root; you use that to ssh in and run code.


There is currently no liveness/restarting built into the scripts; you 
need other things to do that. I am working on this, with  HADOOP-3628, 
https://issues.apache.org/jira/browse/HADOOP-3628


I will be showing some other management options at ApacheCon EU 2009, 
which being on the same continent and timezone is something you may want 
to consider attending; lots of Hadoop people will be there, with some 
all-day sessions on it.

http://eu.apachecon.com/c/aceu2009/sessions/227

One big problem with cluster management is not just recognising failed 
nodes, it's handling them. The actions you take are different with a 
VM-cluster like EC2 (fix: reboot, then kill that AMI and create a new 
one), from that of a VM-ware/Xen-managed cluster, to that of physical 
systems (Y!: phone Allen, us: email paolo). Once we have the health 
monitoring in there different people will need to apply their own policies.


-steve

--
Steve Loughran  http://www.1060.org/blogxter/publish/5



Re: Java RMI and Hadoop RecordIO

2009-01-20 Thread Steve Loughran

David Alves wrote:

Hi
I've been testing some different serialization techniques, to go 
along with a research project.
I know motivation behind hadoop serialization mechanism (e.g. 
Writable) and the enhancement of this feature through record I/O is not 
only performance, but also control of the input/output.
Still I've been running some simple tests and I've foud that plain 
RMi beats Hadoop RecordIO almost every time (14-16% faster).
In my test I have a simple java class that has 14 int fields and 1 
long field and I'm serializing aroung 35000 instances.
Am I doing anything wrong? are there ways to improve performance in 
RecordIO? Have I got the use case wrong?

Regards

David Alves



-. Any speedups are welcome; people are looking at ProtocolBuffers and 
Thrift

- Are you also measuring packet size and deserialization costs?
- add a string or two
- and references to other instances
- then try pushing a few million round the network using the same 
serialization stream instance



I do use RMI a lot at work, once you come up with a plan to deal with 
its brittleness against change (we keep the code in the cluster up to 
date, make no guarantees about compatibility across versions), it is 
easy to use. but it has so many, many problems, and if you hit one, as 
the code is deep in the JVM, it is very hard to deal with. One example, 
RMI tries to send a graph over; it likes to make sure it hasn't pushed a 
copy over earlier. The longer you keep a serialization stream up, the 
slower it gets.


-steve


Re: RAID vs. JBOD

2009-01-15 Thread Steve Loughran

Runping Qi wrote:

Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD:  67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.
 


-- Runping



This is really interesting. Thank you for sharing these results!

Presumably the servers were all set up with nominally homogenous 
hardware? And yet still the variations existed. That would be something 
to experiment with on new versus old clusters to see if it gets worse 
over time.


Here we have a batch of desktop workstations all bought at the same 
time, to the same spec, but one of them, lucky is more prone to race 
conditions than any of the others. We don't know why, and assume its do 
with the (multiple) Xeon CPU chips being at different ends of the bell 
curve or something. all we know is: test on that box before shipping to 
find race conditions early.


-steve


Re: Indexed Hashtables

2009-01-15 Thread Steve Loughran

Sean Shanny wrote:

Delip,

So far we have had pretty good luck with memcached.  We are building a 
hadoop based solution for data warehouse ETL on XML based log files that 
represent click stream data on steroids.


We process about 34 million records or about 70 GB data a day.  We have 
to process dimensional data in our warehouse and then load the surrogate 
keyvalue pairs in memcached so we can traverse the XML files once 
again to perform the substitutions.  We are using the memcached solution 
because is scales out just like hadoop.  We will have code that allows 
us to fall back to the DB if the memcached lookup fails but that should 
not happen to often.




LinkedIn have just opened up something they run internally, Project 
Voldemort:


http://highscalability.com/product-project-voldemort-distributed-database
http://project-voldemort.com/

It's a DHT, Java based. I haven't played with it yet, but it looks like 
a good part of the portfolio.




Re: issues with hadoop in AIX

2009-01-05 Thread Steve Loughran

Allen Wittenauer wrote:

On 12/27/08 12:18 AM, Arun Venugopal arunvenugopa...@gmail.com wrote:

Yes, I was able to run this on AIX as well with a minor change to the
DF.java code. But this was more of a proof of concept than on a
production system.


There are lots of places where Hadoop (esp. in contrib) interprets the
output of Unix command line utilities. Changes like this are likely going to
be required for AIX and other Unix systems that aren't being used by a
committer. :(



that aren't being used in the test process, equally importantly


I think hudson runs on Solaris, doesn't it?


  1   2   3   >