Re: Sorting the OutputCollector

2008-04-09 Thread Owen O'Malley

On Apr 8, 2008, at 4:54 AM, Aayush Garg wrote:

I construct this type of key, value pairs in the outputcollector of  
reduce
phase. Now I want to SORT  this   outputcollector in decreasing  
order of

the value of frequency in Custom class.
Could some one suggest the possible way to do this?


In order to re-sort the output of the reduce, you need to run a  
second job, which has inputs of the first job's output.


-- Owen


Lucene on Hadoop

2008-04-09 Thread Yingfeng Zhang
Hi,
From this URL
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00998.html
I see that Hadoop is not suitable for incremental updates if the inverted
files is based on it, and what's more, Nutch has adopted Hadoop, that means
the incremental updates ability provided by Lucene will not work in Nutch.

Also, It is suggested to experiment with HBase, which is a BigTable based on
GFS. Since HBase is based Hadoop, then what is a difference if using HBase
for incremental indexing?  Thanks a lot for attentions.


Best
Yingfeng


RE: Reduce Sort

2008-04-09 Thread Natarajan, Senthil
Ted,
I am using IntWritable, the default is as you mentioned Ascending order sort, 
do you know how to set this to sort descending order.
I checked the API for IntWritable, LongWritable and JobConf. I couldn't find 
any methods.
Thanks,
Senthil

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 08, 2008 11:53 AM
To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]'
Subject: Re: Reduce Sort


There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-0 --output-document=a
   wget $base/your-results-directory/part-1 --output-document=b
   sort -k1nr a b  where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.


On 4/8/08 8:37 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote:

 Hi,
 I am new to MapReduce.

 After slightly modifying the example wordcount, to count the IP Address.
 I have two files part-0 and part-1 with the contents something like.

 IP Add   Count
 1.2. 5. 42   27
 2.8. 6. 6   24
 7.9.24.13   8
 7.9. 6. 9201

 I want to sort it by IP Address count in descending order(i.e.) I would expect
 to see

 7.9. 6. 9201
 1.2. 5. 42   27
 2.8. 6. 6   24
 7.9.24.13   8

 Could you please suggest how to do this.
 And to merge both the partitions (part-0 and part-1) in to one output
 file, is there any functions already available in MapReduce Framework.
 Or we need to use Java IO to do this.
 Thanks,
 Senthil



Re: Sorting the OutputCollector

2008-04-09 Thread Aayush Garg
But the problem is that I need to sort according to freq which is the part
of my value field...
Any inputs?? Could you provide smal piece of code of your thought


On Wed, Apr 9, 2008 at 9:45 AM, Owen O'Malley [EMAIL PROTECTED] wrote:

 On Apr 8, 2008, at 4:54 AM, Aayush Garg wrote:

  I construct this type of key, value pairs in the outputcollector of
  reduce
  phase. Now I want to SORT  this   outputcollector in decreasing order
  of
  the value of frequency in Custom class.
  Could some one suggest the possible way to do this?
 

 In order to re-sort the output of the reduce, you need to run a second
 job, which has inputs of the first job's output.

 -- Owen




-- 
Aayush Garg,
Phone: +41 76 482 240


need help

2008-04-09 Thread krishna prasanna
Hi I started using hadoop very recently I am struct with the basic example 
when i am trying to run 
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

i am getting output as 

08/04/09 21:23:12 INFO mapred.FileInputFormat: Total input paths to process : 2
java.io.IOException: Not a file: hdfs://localhost:9000/user/Administrator/input/
conf
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja
va:170)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
at org.apache.hadoop.examples.WordCount.run(WordCount.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:153)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra
mDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)

Where are there libraries residing..? How should i configure this.?

Thanks  Regards,
Krishna.


  Unlimited freedom, unlimited storage. Get it now, on 
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/

Does any one tried to build Hadoop..

2008-04-09 Thread krishna prasanna

Does any one tried to build Hadoop ?

Thanks  Regards,
Krishna.


  Meet people who discuss and share your passions. Go to 
http://in.promos.yahoo.com/groups/bestofyahoo/

Re: Does any one tried to build Hadoop..

2008-04-09 Thread Jean-Daniel Cryans
The ANT script works well also.

Jean-Daniel

2008/4/9, Khalil Honsali [EMAIL PROTECTED]:

 Hi,
 With eclise it's easy, you just have to add it as a new project, make sure
 you add all libraries in folder lib and should compile fine
 There is also an eclipse plugin for running hadoop jobs directly from
 eclipse on an installed hadoop .


 On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
 
 
  Does any one tried to build Hadoop ?
 
  Thanks  Regards,
  Krishna.
 
 
Meet people who discuss and share your passions. Go to
  http://in.promos.yahoo.com/groups/bestofyahoo/



Re: Does any one tried to build Hadoop..

2008-04-09 Thread Khalil Honsali
Mr. Jean-Daniel,

where is the ant script please?

On 10/04/2008, Jean-Daniel Cryans [EMAIL PROTECTED] wrote:

 The ANT script works well also.

 Jean-Daniel

 2008/4/9, Khalil Honsali [EMAIL PROTECTED]:

 
  Hi,
  With eclise it's easy, you just have to add it as a new project, make
 sure
  you add all libraries in folder lib and should compile fine
  There is also an eclipse plugin for running hadoop jobs directly from
  eclipse on an installed hadoop .
 
 
  On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
  
  
   Does any one tried to build Hadoop ?
  
   Thanks  Regards,
   Krishna.
  
  
 Meet people who discuss and share your passions. Go to
   http://in.promos.yahoo.com/groups/bestofyahoo/
 




--


Hadoop VS MogileFS

2008-04-09 Thread Garri Santos
Hi!

Im new to DFS and wondering what are the advantages and disadvantages of
hadoop and mogileFS to anyone that have experienced both.


Thanks,
Garri


Hadoop performance on EC2?

2008-04-09 Thread Nate Carlson

Hey all,

We've got a job that we're running in both a development environment, and 
out on EC2.  I've been rather displeased with the performance on EC2, and 
was curious if the results that we've been seeing are similar to other 
people's, or if I've got something misconfigured.  ;)  In both 
environments, the load on the master node is around 1-1.5, and the load on 
the slave nodes in around 8-10. I have also tried cranking up the JVM 
memory on the EC2 nodes (since we got RAM to blow), with very little 
performance difference.


Basically, the job takes about 3.5 hours on development, but takes 15 
hours on EC2. With the portion that takes all the time, it is not 
dependent on any external hosts - just the MySQL server on the master 
node.  I benchmarked the VCPU's between our dev and EC2, and they are 
about equivilent.. I would expect EC2 to take 1.5x as long, since there is 
one less CPU per slave, but it's taking much longer than that.


Appreciate any tips!

Similarities between the environments:
- 1 master node, 2 slave nodes
- 1 mapper and reducer on the master, 8 mappers and 7 reducers on the
  slaves
- Hadoop 0.16.2
- Local HDFS storage (we were using S3 on amazon before, and I switched to
  local storage)
- MySQL database running on the master node
- Xen VM's in both environments (our own Xen for dev, Amazon's for EC2)
- Debian Etch 64-bit OS; 64-bit JVM

Development master node configuration:
- 4x VCPU's (Xeon E5335 2ghz)
- 3gb memory
- 4gb swap

Development slave nodes configuration:
- 3x VCPU's (Xeon E5335 2ghz)
- 2gb memory
- 4gb swap

EC2 Configuration (Large instance type):
- 2x VCPU's (Opteron 2ghz)
- 8gb memory
- 4gb swap
- All nodes running in the same availabity zone


| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



hdfs 100T?

2008-04-09 Thread Todd Troxell
Hello list,

I was unable to access the archives for this list as
http://hadoop.apache.org/mail/core-user/ returns 403.

I am interested in using HDFS for storage, and for map/reduce only
tangentially.  I see clusters mentioned in the docs with many many nodes and
9TB of disk.

Is HDFS expected to scale to  100TB?

Does it require massive parallelism to scale to many files?  For instance, do 
you
think it would slow down drastically in a 2 node 32T config?

The workload is file serving at about 100mbit, 20req/sec.

Any input is appreciated :)

-- 
Todd Troxell
http://rapidpacket.com/~xtat


Re: Hadoop performance on EC2?

2008-04-09 Thread Chris K Wensel

a few things..

make sure all nodes are running in the same 'availability zone',
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347

and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101

also, make sure each node is addressing its peers via the ec2 private  
addresses, not the public ones.


there is a patch in jira for the ec2/contrib scripts that address  
these issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display  
showing utilization on the machines. 8/7 map/reducers sounds like alot.


ymmv

On Apr 9, 2008, at 7:07 PM, Nate Carlson wrote:

Hey all,

We've got a job that we're running in both a development  
environment, and out on EC2.  I've been rather displeased with the  
performance on EC2, and was curious if the results that we've been  
seeing are similar to other people's, or if I've got something  
misconfigured.  ;)  In both environments, the load on the master  
node is around 1-1.5, and the load on the slave nodes in around  
8-10. I have also tried cranking up the JVM memory on the EC2 nodes  
(since we got RAM to blow), with very little performance difference.


Basically, the job takes about 3.5 hours on development, but takes  
15 hours on EC2. With the portion that takes all the time, it is not  
dependent on any external hosts - just the MySQL server on the  
master node.  I benchmarked the VCPU's between our dev and EC2, and  
they are about equivilent.. I would expect EC2 to take 1.5x as long,  
since there is one less CPU per slave, but it's taking much longer  
than that.


Appreciate any tips!

Similarities between the environments:
- 1 master node, 2 slave nodes
- 1 mapper and reducer on the master, 8 mappers and 7 reducers on the
 slaves
- Hadoop 0.16.2
- Local HDFS storage (we were using S3 on amazon before, and I  
switched to

 local storage)
- MySQL database running on the master node
- Xen VM's in both environments (our own Xen for dev, Amazon's for  
EC2)

- Debian Etch 64-bit OS; 64-bit JVM

Development master node configuration:
- 4x VCPU's (Xeon E5335 2ghz)
- 3gb memory
- 4gb swap

Development slave nodes configuration:
- 3x VCPU's (Xeon E5335 2ghz)
- 2gb memory
- 4gb swap

EC2 Configuration (Large instance type):
- 2x VCPU's (Opteron 2ghz)
- 8gb memory
- 4gb swap
- All nodes running in the same availabity zone


| nate carlson | [EMAIL PROTECTED] | http:// 
www.natecarlson.com |
|   depriving some poor village of its idiot since  
1981|




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/