Re: Sorting the OutputCollector
On Apr 8, 2008, at 4:54 AM, Aayush Garg wrote: I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this outputcollector in decreasing order of the value of frequency in Custom class. Could some one suggest the possible way to do this? In order to re-sort the output of the reduce, you need to run a second job, which has inputs of the first job's output. -- Owen
Lucene on Hadoop
Hi, From this URL http://www.mail-archive.com/[EMAIL PROTECTED]/msg00998.html I see that Hadoop is not suitable for incremental updates if the inverted files is based on it, and what's more, Nutch has adopted Hadoop, that means the incremental updates ability provided by Lucene will not work in Nutch. Also, It is suggested to experiment with HBase, which is a BigTable based on GFS. Since HBase is based Hadoop, then what is a difference if using HBase for incremental indexing? Thanks a lot for attentions. Best Yingfeng
RE: Reduce Sort
Ted, I am using IntWritable, the default is as you mentioned Ascending order sort, do you know how to set this to sort descending order. I checked the API for IntWritable, LongWritable and JobConf. I couldn't find any methods. Thanks, Senthil -Original Message- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 11:53 AM To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]' Subject: Re: Reduce Sort There are two ways to do this. Both of them assume that you have counted the addresses using map-reduce and the results are in HDFS. First, since the number of unique IP address is likely to be relatively small, simply sorting the results using conventional sort is probably as good as it gets. This will take just a few lines of scripting code: base=http://your-nameserver-here/data wget $base/your-results-directory/part-0 --output-document=a wget $base/your-results-directory/part-1 --output-document=b sort -k1nr a b where-you-want-the-output It would be convenient if there were a URL that would allow you to retrieve the concatenation of a wild-carded list of files, but the method I show above isn't bad. You are likely to be unhappy at the perceived impurity of this approach, but I would ask to think about why one might use hadoop at all. The best reason is to get high performance on large problems. The sorting part of this problem is not all that big a deal and using a conventional sort is probably the most effective approach here. You can also do the sorting using hadoop. Just use a mapper that moves the count to the key and keeps the IP as the value. I think that if you use an IntWritable or LongWritable as the key then the default sorting would give you ascending order. You can also define the sort order so that you get descending order. Make sure you set the number of reducers to 1 so that you only get a single output file. If you have less than 10 million values, the conventional sort is likely to be faster simply because of hadoop's startup time. On 4/8/08 8:37 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: Hi, I am new to MapReduce. After slightly modifying the example wordcount, to count the IP Address. I have two files part-0 and part-1 with the contents something like. IP Add Count 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 7.9. 6. 9201 I want to sort it by IP Address count in descending order(i.e.) I would expect to see 7.9. 6. 9201 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 Could you please suggest how to do this. And to merge both the partitions (part-0 and part-1) in to one output file, is there any functions already available in MapReduce Framework. Or we need to use Java IO to do this. Thanks, Senthil
Re: Sorting the OutputCollector
But the problem is that I need to sort according to freq which is the part of my value field... Any inputs?? Could you provide smal piece of code of your thought On Wed, Apr 9, 2008 at 9:45 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Apr 8, 2008, at 4:54 AM, Aayush Garg wrote: I construct this type of key, value pairs in the outputcollector of reduce phase. Now I want to SORT this outputcollector in decreasing order of the value of frequency in Custom class. Could some one suggest the possible way to do this? In order to re-sort the output of the reduce, you need to run a second job, which has inputs of the first job's output. -- Owen -- Aayush Garg, Phone: +41 76 482 240
need help
Hi I started using hadoop very recently I am struct with the basic example when i am trying to run bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' i am getting output as 08/04/09 21:23:12 INFO mapred.FileInputFormat: Total input paths to process : 2 java.io.IOException: Not a file: hdfs://localhost:9000/user/Administrator/input/ conf at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:170) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753) at org.apache.hadoop.examples.WordCount.run(WordCount.java:147) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.WordCount.main(WordCount.java:153) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra mDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) Where are there libraries residing..? How should i configure this.? Thanks Regards, Krishna. Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
Does any one tried to build Hadoop..
Does any one tried to build Hadoop ? Thanks Regards, Krishna. Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups/bestofyahoo/
Re: Does any one tried to build Hadoop..
The ANT script works well also. Jean-Daniel 2008/4/9, Khalil Honsali [EMAIL PROTECTED]: Hi, With eclise it's easy, you just have to add it as a new project, make sure you add all libraries in folder lib and should compile fine There is also an eclipse plugin for running hadoop jobs directly from eclipse on an installed hadoop . On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote: Does any one tried to build Hadoop ? Thanks Regards, Krishna. Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups/bestofyahoo/
Re: Does any one tried to build Hadoop..
Mr. Jean-Daniel, where is the ant script please? On 10/04/2008, Jean-Daniel Cryans [EMAIL PROTECTED] wrote: The ANT script works well also. Jean-Daniel 2008/4/9, Khalil Honsali [EMAIL PROTECTED]: Hi, With eclise it's easy, you just have to add it as a new project, make sure you add all libraries in folder lib and should compile fine There is also an eclipse plugin for running hadoop jobs directly from eclipse on an installed hadoop . On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote: Does any one tried to build Hadoop ? Thanks Regards, Krishna. Meet people who discuss and share your passions. Go to http://in.promos.yahoo.com/groups/bestofyahoo/ --
Hadoop VS MogileFS
Hi! Im new to DFS and wondering what are the advantages and disadvantages of hadoop and mogileFS to anyone that have experienced both. Thanks, Garri
Hadoop performance on EC2?
Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration (Large instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com | | depriving some poor village of its idiot since 1981|
hdfs 100T?
Hello list, I was unable to access the archives for this list as http://hadoop.apache.org/mail/core-user/ returns 403. I am interested in using HDFS for storage, and for map/reduce only tangentially. I see clusters mentioned in the docs with many many nodes and 9TB of disk. Is HDFS expected to scale to 100TB? Does it require massive parallelism to scale to many files? For instance, do you think it would slow down drastically in a 2 node 32T config? The workload is file serving at about 100mbit, 20req/sec. Any input is appreciated :) -- Todd Troxell http://rapidpacket.com/~xtat
Re: Hadoop performance on EC2?
a few things.. make sure all nodes are running in the same 'availability zone', http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347 and that you are using the new xen kernels. http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101 also, make sure each node is addressing its peers via the ec2 private addresses, not the public ones. there is a patch in jira for the ec2/contrib scripts that address these issues. https://issues.apache.org/jira/browse/HADOOP-2410 if you use those scripts, you will be able to see a ganglia display showing utilization on the machines. 8/7 map/reducers sounds like alot. ymmv On Apr 9, 2008, at 7:07 PM, Nate Carlson wrote: Hey all, We've got a job that we're running in both a development environment, and out on EC2. I've been rather displeased with the performance on EC2, and was curious if the results that we've been seeing are similar to other people's, or if I've got something misconfigured. ;) In both environments, the load on the master node is around 1-1.5, and the load on the slave nodes in around 8-10. I have also tried cranking up the JVM memory on the EC2 nodes (since we got RAM to blow), with very little performance difference. Basically, the job takes about 3.5 hours on development, but takes 15 hours on EC2. With the portion that takes all the time, it is not dependent on any external hosts - just the MySQL server on the master node. I benchmarked the VCPU's between our dev and EC2, and they are about equivilent.. I would expect EC2 to take 1.5x as long, since there is one less CPU per slave, but it's taking much longer than that. Appreciate any tips! Similarities between the environments: - 1 master node, 2 slave nodes - 1 mapper and reducer on the master, 8 mappers and 7 reducers on the slaves - Hadoop 0.16.2 - Local HDFS storage (we were using S3 on amazon before, and I switched to local storage) - MySQL database running on the master node - Xen VM's in both environments (our own Xen for dev, Amazon's for EC2) - Debian Etch 64-bit OS; 64-bit JVM Development master node configuration: - 4x VCPU's (Xeon E5335 2ghz) - 3gb memory - 4gb swap Development slave nodes configuration: - 3x VCPU's (Xeon E5335 2ghz) - 2gb memory - 4gb swap EC2 Configuration (Large instance type): - 2x VCPU's (Opteron 2ghz) - 8gb memory - 4gb swap - All nodes running in the same availabity zone | nate carlson | [EMAIL PROTECTED] | http:// www.natecarlson.com | | depriving some poor village of its idiot since 1981| Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/