Re: Does any one tried to build Hadoop..

2008-04-11 Thread Khalil Honsali
I now understand your problem, I replicated it.
If you load the build.xml from eclipse, and go to the properties>build
path>Libraries, you'll find a JRE_LIB, remove that one and add JRE System
Library.
hope it solves it.

On 12/04/2008, Khalil Honsali <[EMAIL PROTECTED]> wrote:
>
> my guess it's an import problem..
> how about changing 2) to version 6 for compiler version?
>
> On 12/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> >
> > Java version
> > java version "1.6.0_05"
> > Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
> > Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
> >
> > Steps that i did :
> > 1) Opened a new java project in Eclipse. (From existing directory path).
> > 2) Modified Java compiler version as 5 in project properties in order
> > solve (source level 5 error).
> > 3) I found that package javax.net.SocketFactory is not resolved then i
> > downloaded that package and add to external jars.
> >
> > then i got error mentioned below.
> >
> >
> > Thanks & Regards,
> > Krishna
> >
> > - Original Message 
> >
> > From: Khalil Honsali <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> >
> > Sent: Friday, 11 April, 2008 6:54:46 PM
> > Subject: Re: Does any one tried to build Hadoop..
> >
> > what is your java version? also please describe exactly what you've done
> >
> > On 11/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> > >
> > > I Tried in both ways i am still i am getting some errors
> > >
> > > --- import org.apache.tools.ant.BuildException; (error: cannot be
> > > resolved..)
> > > --- public Socket createSocket() throws IOException {
> > > --- s = socketFactory.createSocket(); (error:  incorrect parameters)
> > >
> > > earlier it failed to resolve this package (javax.net.SocketFactory;)
> > then
> > > i add that jar file in project.
> > >
> > > Thanks & Regards,
> > > Krishna.
> > >
> > >
> > > - Original Message 
> > > From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> > > To: [EMAIL PROTECTED]
> > > Sent: Thursday, 10 April, 2008 4:07:34 PM
> > > Subject: Re: Does any one tried to build Hadoop..
> > >
> > > At the root of the source and it's called build.xml
> > >
> > > Jean-Daniel
> > >
> > > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > > >
> > > > Mr. Jean-Daniel,
> > > >
> > > > where is the ant script please?
> > > >
> > > >
> > > > On 10/04/2008, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > The ANT script works well also.
> > > > >
> > > > > Jean-Daniel
> > > > >
> > > > > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > > > >
> > > > > >
> > > > > > Hi,
> > > > > > With eclise it's easy, you just have to add it as a new project,
> > > make
> > > > > sure
> > > > > > you add all libraries in folder lib and should compile fine
> > > > > > There is also an eclipse plugin for running hadoop jobs directly
> > > from
> > > > > > eclipse on an installed hadoop .
> > > > > >
> > > > > >
> > > > > > On 10/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Does any one tried to build Hadoop ?
> > > > > > >
> > > > > > > Thanks & Regards,
> > > > > > > Krishna.
> > > > > > >
> > > > > > >
> > > > > > >  Meet people who discuss and share your passions. Go to
> > > > > > > http://in.promos.yahoo.com/groups/bestofyahoo/
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > >
> > >
> > >
> > >   Bring your gang together. Do your thing. Find your favourite
> > Yahoo!
> > > group at http://in.promos.yahoo.com/groups/
> >
> >
> >
> >
> >
> >
> >
> >   Unlimited freedom, unlimited storage. Get it now, on
> > http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/
>
>
>
>


--


Re: Does any one tried to build Hadoop..

2008-04-11 Thread Khalil Honsali
my guess it's an import problem..
how about changing 2) to version 6 for compiler version?

On 12/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
>
> Java version
> java version "1.6.0_05"
> Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
> Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
>
> Steps that i did :
> 1) Opened a new java project in Eclipse. (From existing directory path).
> 2) Modified Java compiler version as 5 in project properties in order
> solve (source level 5 error).
> 3) I found that package javax.net.SocketFactory is not resolved then i
> downloaded that package and add to external jars.
>
> then i got error mentioned below.
>
>
> Thanks & Regards,
> Krishna
>
> - Original Message 
>
> From: Khalil Honsali <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
>
> Sent: Friday, 11 April, 2008 6:54:46 PM
> Subject: Re: Does any one tried to build Hadoop..
>
> what is your java version? also please describe exactly what you've done
>
> On 11/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> >
> > I Tried in both ways i am still i am getting some errors
> >
> > --- import org.apache.tools.ant.BuildException; (error: cannot be
> > resolved..)
> > --- public Socket createSocket() throws IOException {
> > --- s = socketFactory.createSocket(); (error:  incorrect parameters)
> >
> > earlier it failed to resolve this package (javax.net.SocketFactory;)
> then
> > i add that jar file in project.
> >
> > Thanks & Regards,
> > Krishna.
> >
> >
> > - Original Message 
> > From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Thursday, 10 April, 2008 4:07:34 PM
> > Subject: Re: Does any one tried to build Hadoop..
> >
> > At the root of the source and it's called build.xml
> >
> > Jean-Daniel
> >
> > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > >
> > > Mr. Jean-Daniel,
> > >
> > > where is the ant script please?
> > >
> > >
> > > On 10/04/2008, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> > > >
> > > > The ANT script works well also.
> > > >
> > > > Jean-Daniel
> > > >
> > > > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > > >
> > > > >
> > > > > Hi,
> > > > > With eclise it's easy, you just have to add it as a new project,
> > make
> > > > sure
> > > > > you add all libraries in folder lib and should compile fine
> > > > > There is also an eclipse plugin for running hadoop jobs directly
> > from
> > > > > eclipse on an installed hadoop .
> > > > >
> > > > >
> > > > > On 10/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > >
> > > > > > Does any one tried to build Hadoop ?
> > > > > >
> > > > > > Thanks & Regards,
> > > > > > Krishna.
> > > > > >
> > > > > >
> > > > > >  Meet people who discuss and share your passions. Go to
> > > > > > http://in.promos.yahoo.com/groups/bestofyahoo/
> > > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > >
> >
> >
> >
> >   Bring your gang together. Do your thing. Find your favourite
> Yahoo!
> > group at http://in.promos.yahoo.com/groups/
>
>
>
>
>
>
>
>   Unlimited freedom, unlimited storage. Get it now, on
> http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/


Re: "could only be replicated to 0 nodes, instead of 1"

2008-04-11 Thread Raghu Angadi

jerrro wrote:


I couldn't find much information about this error, but I did manage to see
somewhere it might mean that there are no datanodes running. But as I said,
start-all does not give any errors. Any ideas what could be problem?


start-all return does not mean datanodes are ok. Did you check if there 
are any datanodes alive? You can check from http://namenode:50070/.


Raghu.



Re: Hadoop performance on EC2?

2008-04-11 Thread Chris K Wensel

What does ganglia show for load and network?

You should also be able to see gc stats (count and time). Might help  
as well.


fyi,
running
> hadoop-ec2 proxy 

will both setup a socks tunnel and list available urls you can cut/ 
paste into your browser. one of the urls is for the ganglia interface.


On Apr 11, 2008, at 2:01 PM, Nate Carlson wrote:

On Wed, 9 Apr 2008, Chris K Wensel wrote:

make sure all nodes are running in the same 'availability zone', 
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347


check!


and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353&categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354&categoryID=101


check!

also, make sure each node is addressing its peers via the ec2  
private addresses, not the public ones.


check!

there is a patch in jira for the ec2/contrib scripts that address  
these issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display  
showing utilization on the machines. 8/7 map/reducers sounds like  
alot.


Reduced - I dropped it to 3/2 for testing.

I am using these scripts now, and am still seeing very poor  
performance on EC2 compared to my development environment.  ;(


I'll be capturing some more extensive stats over the weekend, and  
see if I can glean anything useful...



| nate carlson | [EMAIL PROTECTED] | http:// 
www.natecarlson.com |
|   depriving some poor village of its idiot since  
1981|




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Hadoop performance on EC2?

2008-04-11 Thread Nate Carlson

On Wed, 9 Apr 2008, Chris K Wensel wrote:
make sure all nodes are running in the same 'availability zone', 
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347


check!


and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353&categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354&categoryID=101


check!

also, make sure each node is addressing its peers via the ec2 private 
addresses, not the public ones.


check!

there is a patch in jira for the ec2/contrib scripts that address these 
issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display 
showing utilization on the machines. 8/7 map/reducers sounds like alot.


Reduced - I dropped it to 3/2 for testing.

I am using these scripts now, and am still seeing very poor performance on 
EC2 compared to my development environment.  ;(


I'll be capturing some more extensive stats over the weekend, and see if I 
can glean anything useful...



| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



RE: Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread Devaraj Das
Which hadoop version are you on? 

> -Original Message-
> From: bhupesh bansal [mailto:[EMAIL PROTECTED] 
> Sent: Friday, April 11, 2008 11:21 PM
> To: [EMAIL PROTECTED]
> Subject: Mapper OutOfMemoryError Revisited !!
> 
> 
> Hi Guys, I need to restart discussion around 
> http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html
> 
>  I saw the same OOM error in my map-reduce job in the map phase. 
> 
> 1. I tried changing mapred.child.java.opts (bumped to 600M) 
> 2. io.sort.mb was kept at 100MB. 
> 
> I see the same errors still. 
> 
> I checked with debug the size of "keyValBuffer" in collect(), 
> that is always less than io.sort.mb and is spilled to disk properly.
> 
> I tried changing the map.task number to a very high number so 
> that the input is split into smaller chunks. It helps for a 
> while as the map job went a bit far (56% from 5%) but still 
> see the problem.
> 
>  I tried bumping mapred.child.java.opts to 1000M , still got 
> the same error. 
> 
> I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] 
> value in opts to get the gc.log but didnt got any log??
> 
>  I tried using 'jmap -histo pid' to see the heap information, 
> it didnt gave me any meaningful or obvious problem point. 
> 
> What are the other possible memory hog during mapper phase ?? 
> Is the input file chunk kept fully in memory ?? 
> 
> Application: 
> 
> My map-reduce job is running with about 2G of input. in the 
> Mapper phase I read each line and output [5-500] (key,value) 
> pair. so the intermediate data should be really blown up.  
> will that be a problem. 
> 
> The Error file is attached
> http://www.nabble.com/file/p16628181/error.txt error.txt
> --
> View this message in context: 
> http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21
> -tp16628181p16628181.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 
> 



Re: 答复: Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Pete Wyckoff

Yes and as such, we've found better load balancing when the #of reduces is a
prime #.  Although the string.hashCode isn't great for short strings.


On 4/11/08 4:16 AM, "Zhang, jian" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Please read this, you need to implement partitioner.
> It controls which key is sent to which reducer, if u want to get unique key
> result, you need to implement partitioner and the compareTO function should
> work properly. 
> [WIKI]
> Partitioner
> 
> Partitioner partitions the key space.
> 
> Partitioner controls the partitioning of the keys of the intermediate
> map-outputs. The key (or a subset of the key) is used to derive the partition,
> typically by a hash function. The total number of partitions is the same as
> the number of reduce tasks for the job. Hence this controls which of the m
> reduce tasks the intermediate key (and hence the record) is sent to for
> reduction.
> 
> HashPartitioner is the default Partitioner.
> 
> 
> 
> Best Regards
> 
> Jian Zhang
> 
> 
> -邮件原件-
> 发件人: Harish Mallipeddi [mailto:[EMAIL PROTECTED]
> 发送时间: 2008年4月11日 19:06
> 收件人: core-user@hadoop.apache.org
> 主题: Problem with key aggregation when number of reduce tasks is more than 1
> 
> Hi all,
> 
> I wrote a custom key class (implements WritableComparable) and implemented
> the compareTo() method inside this class. Everything works fine when I run
> the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
> correctly in the output files.
> 
> But when I increase the number of reduce tasks, keys don't get aggregated
> properly; same keys seem to end up in separate output files
> (output/part-0, output/part-1, etc). This should not happen because
> right before reduce() gets called, all (k,v) pairs from all map outputs with
> the same 'k' are aggregated and the reduce function just iterates over the
> values (v1, v2, etc)?
> 
> Do I need to implement anything else inside my custom key class other than
> compareTo? I also tried implementing equals() but that didn't help either.
> Then I came across setOutputKeyComparator(). So I added a custom Comparator
> class inside the key class and tried setting this on the JobConf object. But
> that didn't work either. What could be wrong?
> 
> Cheers,



Re: Using NFS without HDFS

2008-04-11 Thread slitz
Thank you for the file:/// tip, i was not including it in the paths.
I'm running the example with this line -> bin/hadoop jar
hadoop-*-examples.jar grep file:///home/slitz/warehouse/input
file:///home/slitz/warehouse/output 'dfs[a-z.]+'

But i'm getting the same error as before, i'm getting

org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : *
/home/slitz/hadoop-0.15.3/grep-temp-1030179831*
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...stack continues...)

i think the problem may be the input path, it should be pointing to some
path in the nfs share, right?

the grep-temp-* dir is being created in the HADOOP_HOME of Box A (
192.168.2.3).

slitz

On Fri, Apr 11, 2008 at 4:06 PM, Luca <[EMAIL PROTECTED]> wrote:

> slitz wrote:
>
> > I've read in the archive that it should be possible to use any
> > distributed
> > filesystem since the data is available to all nodes, so it should be
> > possible to use NFS, right?
> > I've also read somewere in the archive that this shoud be possible...
> >
> >
> As far as I know, you can refer to any file on a mounted file system
> (visible from all compute nodes) using the prefix file:// before the full
> path, unless another prefix has been specified.
>
> Cheers,
> Luca
>
>
>
> > slitz
> >
> >
> > On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]
> > >
> > wrote:
> >
> >  Hello ,
> > >
> > > To execute Hadoop Map-Reduce job input data should be on HDFS not on
> > > NFS.
> > >
> > > Thanks
> > >
> > > ---
> > > Peeyush
> > >
> > >
> > >
> > > On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:
> > >
> > >  Hello,
> > > > I'm trying to assemble a simple setup of 3 nodes using NFS as
> > > >
> > > Distributed
> > >
> > > > Filesystem.
> > > >
> > > > Box A: 192.168.2.3, this box is either the NFS server and working as
> > > > a
> > > >
> > > slave
> > >
> > > > node
> > > > Box B: 192.168.2.30, this box is only JobTracker
> > > > Box C: 192.168.2.31, this box is only slave
> > > >
> > > > Obviously all three nodes can access the NFS shared, and the path to
> > > > the
> > > > share is /home/slitz/warehouse in all three.
> > > >
> > > > My hadoop-site.xml file were copied over all nodes and looks like
> > > > this:
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > fs.default.name
> > > >
> > > >  local
> > > >
> > > > 
> > > >
> > > >  The name of the default file system. Either the literal string
> > > >
> > > > "local" or a host:port for NDFS.
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > >  
> > > >
> > > > mapred.job.tracker
> > > >
> > > >  192.168.2.30:9001
> > > >
> > > > 
> > > >
> > > >  The host and port that the MapReduce job
> > > >
> > > > tracker runs at. If "local", then jobs are
> > > >
> > > >  run in-process as a single map and reduce task.
> > > >
> > > > 
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > > mapred.system.dir
> > > >
> > > >  /home/slitz/warehouse/hadoop_service/system
> > > >
> > > > omgrotfcopterlol.
> > > >
> > > >  
> > > >
> > > > 
> > > >
> > > >
> > > > As one can see, i'm not using HDFS at all.
> > > > (Because all the free space i have is located in only one node, so
> > > > using
> > > > HDFS would be unnecessary overhead)
> > > >
> > > > I've copied the input folder from hadoop to
> > > > /home/slitz/warehouse/input.
> > > > When i try to run the example line
> > > >
> > > > bin/hadoop jar hadoop-*-examples.jar grep
> > > > /home/slitz/warehouse/input/
> > > > /home/slitz/warehouse/output 'dfs[a-z.]+'
> > > >
> > > > the job starts and finish okay but at the end i get this error:
> > > >
> > > > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't
> > > > exist
> > > >
> > > :
> > >
> > > > /home/slitz/hadoop-0.15.3/grep-temp-141595661
> > > > at
> > > >
> > > > org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
> > >
> > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
> > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> > > > (...the error stack continues...)
> > > >
> > > > i don't know why the input path being looked is in the local path
> > > > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
> > > >
> > > > Maybe something is missing in my hadoop-site.xml?
> > > >
> > > >
> > > >
> > > > slitz
> > > >
> > >
> >
>
>


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Ted Dunning

Just call addInputFile multiple times after filtering.  (or is it
addInputPath... Don't have documentation handy)


On 4/11/08 6:33 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> Hi
> I have a general purpose input folder that it is used as input in a
> Map/Reduce task. That folder contains files grouped by names.
> 
> I want to configure the JobConf in a way I can filter the files that
> have to be processed from that pass (ie  files which name starts by
> Elementary, or Source etc)  So the task function will only process
> those files.  So if the folder contains 1000 files and only 50 start
> by Elementary. Only those 50 will be processed by my task.
> 
> I could set up different input folders and those containing the
> different files, but I cannot do that.
> 
> 
> Any idea?
> 
> thanks



Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

 I saw the same OOM error in my map-reduce job in the map phase. 

1. I tried changing mapred.child.java.opts (bumped to 600M) 
2. io.sort.mb was kept at 100MB. 

I see the same errors still. 

I checked with debug the size of "keyValBuffer" in collect(), that is always
less than io.sort.mb and is spilled to disk properly.

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks. It helps for a while as the map job went a bit
far (56% from 5%) but still see the problem.

 I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

 I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 

What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 

Application: 

My map-reduce job is running with about 2G of input. in the Mapper phase I
read each line and output [5-500] (key,value) pair. so the intermediate data
should be really blown up.  will that be a problem. 

The Error file is attached
http://www.nabble.com/file/p16628181/error.txt error.txt 
-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628181p16628181.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Does any one tried to build Hadoop..

2008-04-11 Thread krishna prasanna
Java version 
java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)

Steps that i did :
1) Opened a new java project in Eclipse. (From existing directory path).
2) Modified Java compiler version as 5 in project properties in order solve 
(source level 5 error).
3) I found that package javax.net.SocketFactory is not resolved then i 
downloaded that package and add to external jars.

then i got error mentioned below.

Thanks & Regards,
Krishna

- Original Message 
From: Khalil Honsali <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, 11 April, 2008 6:54:46 PM
Subject: Re: Does any one tried to build Hadoop..

what is your java version? also please describe exactly what you've done

On 11/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
>
> I Tried in both ways i am still i am getting some errors
>
> --- import org.apache.tools.ant.BuildException; (error: cannot be
> resolved..)
> --- public Socket createSocket() throws IOException {
> --- s = socketFactory.createSocket(); (error:  incorrect parameters)
>
> earlier it failed to resolve this package (javax.net.SocketFactory;) then
> i add that jar file in project.
>
> Thanks & Regards,
> Krishna.
>
>
> - Original Message 
> From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, 10 April, 2008 4:07:34 PM
> Subject: Re: Does any one tried to build Hadoop..
>
> At the root of the source and it's called build.xml
>
> Jean-Daniel
>
> 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> >
> > Mr. Jean-Daniel,
> >
> > where is the ant script please?
> >
> >
> > On 10/04/2008, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> > >
> > > The ANT script works well also.
> > >
> > > Jean-Daniel
> > >
> > > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > >
> > > >
> > > > Hi,
> > > > With eclise it's easy, you just have to add it as a new project,
> make
> > > sure
> > > > you add all libraries in folder lib and should compile fine
> > > > There is also an eclipse plugin for running hadoop jobs directly
> from
> > > > eclipse on an installed hadoop .
> > > >
> > > >
> > > > On 10/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > >
> > > > > Does any one tried to build Hadoop ?
> > > > >
> > > > > Thanks & Regards,
> > > > > Krishna.
> > > > >
> > > > >
> > > > >  Meet people who discuss and share your passions. Go to
> > > > > http://in.promos.yahoo.com/groups/bestofyahoo/
> > > >
> > >
> >
> >
> >
> >
> > --
> >
>
>
>
>   Bring your gang together. Do your thing. Find your favourite Yahoo!
> group at http://in.promos.yahoo.com/groups/






  Unlimited freedom, unlimited storage. Get it now, on 
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/

Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, 

I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

I saw the same OOM error in my map-reduce job in the map phase.

1. I tried changing mapred.child.java.opts (bumped to 600M)
2. io.sort.mb was kept at 100MB.
I see the same errors still.

I checked with debug the size of "keyValBuffer" in collect(), that is always
less than io.sort.mb and is spilled to disk properly. 

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks.  It helps for a while as the map job went a
bit far (56% from 5%) but still see the problem. 

I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 


What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 


task_200804110926_0004_m_000239_0: java.lang.OutOfMemoryError: Java heap
spacetask_200804110926_0004_m_000239_0:  at
java.util.Arrays.copyOf(Arrays.java:2786)task_200804110926_0004_m_000239_0: 
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.write(DataOutputStream.java:90)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:384)task_200804110926_0004_m_000239_0:
 
at
java.io.DataOutputStream.writeUTF(DataOutputStream.java:306)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.DataObjects.SearchTrackingJoinValue.write(SearchTrackingJoinValue.java:117)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:350)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.readSearchJoinResultsObject(SearchClickJoinMapper.java:131)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.map(SearchClickJoinMapper.java:54)task_200804110926_0004_m_000239_0:
 
at
com.linkedin.Hadoop.Mapper.SearchClickJoinMapper.map(SearchClickJoinMapper.java:31)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)task_200804110926_0004_m_000239_0:
 
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)task_200804110926_0004_m_000239_0:
 
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804)


-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628173p16628173.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Arun C Murthy


On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:

A simpler way is to use FileInputFormat.setInputPathFilter(JobConf,  
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on  
PathFilter interface.


+1, although FileInputFormat.setInputPathFilter is available only in  
hadoop-0.17 and above... like Amar mentioned previously, you'd have  
to have a custom InputFormat prior to hadoop-0.17.


Arun


Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks







Re: mailing list archive broken?

2008-04-11 Thread Nathan Fiedler
Yes, it's been like that for days. Hopefully someone in Apache can fix
it. In the meantime, you can use the Nabble site:
http://www.nabble.com/Hadoop-core-user-f30590.html

n


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, 
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on 
PathFilter interface.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
One way to do this is to write your own (file) input format. See 
src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to 
override listPaths() in order to have selectivity amongst the files in 
the input folder.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




MiniDFSCluster error on windows.

2008-04-11 Thread Edward J. Yoon
It occurs only on windows system. (cygwin)
Does anyone have the solution?


Testcase: testCosine took 0.708 sec
Caused an ERROR
Address family not supported by protocol family: bind
java.net.SocketException: Address family not supported by protocol family: bind
at sun.nio.ch.Net.bind(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:
119)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
at org.apache.hadoop.ipc.Server.bind(Server.java:182)
at org.apache.hadoop.ipc.Server$Listener.(Server.java:243)
at org.apache.hadoop.ipc.Server.(Server.java:963)
at org.apache.hadoop.ipc.RPC$Server.(RPC.java:393)
at org.apache.hadoop.ipc.RPC.getServer(RPC.java:355)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:122)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:177)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:163)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:866)
at org.apache.hadoop.dfs.MiniDFSCluster.(MiniDFSCluster.java:264)
at org.apache.hadoop.dfs.MiniDFSCluster.(MiniDFSCluster.java:113)

-- 
B. Regards,
Edward J. Yoon


Re: Using NFS without HDFS

2008-04-11 Thread Luca

slitz wrote:

I've read in the archive that it should be possible to use any distributed
filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...



As far as I know, you can refer to any file on a mounted file system 
(visible from all compute nodes) using the prefix file:// before the 
full path, unless another prefix has been specified.


Cheers,
Luca



slitz


On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:


Hello ,

To execute Hadoop Map-Reduce job input data should be on HDFS not on
NFS.

Thanks

---
Peeyush



On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:


Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as

Distributed

Filesystem.

Box A: 192.168.2.3, this box is either the NFS server and working as a

slave

node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave

Obviously all three nodes can access the NFS shared, and the path to the
share is /home/slitz/warehouse in all three.

My hadoop-site.xml file were copied over all nodes and looks like this:





fs.default.name

 local



 The name of the default file system. Either the literal string

"local" or a host:port for NDFS.

 



 

mapred.job.tracker

 192.168.2.30:9001



 The host and port that the MapReduce job

tracker runs at. If "local", then jobs are

 run in-process as a single map and reduce task.



 



mapred.system.dir

 /home/slitz/warehouse/hadoop_service/system

omgrotfcopterlol.

 




As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so using
HDFS would be unnecessary overhead)

I've copied the input folder from hadoop to /home/slitz/warehouse/input.
When i try to run the example line

bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'

the job starts and finish okay but at the end i get this error:

org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist

:

/home/slitz/hadoop-0.15.3/grep-temp-141595661
at


org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)

i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)

Maybe something is missing in my hadoop-site.xml?



slitz







Re: Hadoop performance on EC2?

2008-04-11 Thread Nate Carlson

On Thu, 10 Apr 2008, Ted Dziuba wrote:
I have seen EC2 be slower than a comparable system in development, but 
not by the factors that you're experiencing.  One thing about EC2 that 
has concerned me - you are not guaranteed that your "/mnt" disk is an 
uncontested spindle. Early on, this was the case, but Amazon made no 
promises.


Interesting! My understand was that it was. We were using S3 for storage 
before, and switched to HDFS, and saw similar performance on both for our 
needs.. we're more CPU intensive than I/O intensive.


Also, and this may be a stupid question, are you sure that you're using 
the same JVM in EC2 and development?  GCJ is much slower than Sun's JVM.


Yeah - our code actually requires Sun's Java6u5 JVM.. it won't run on gcj. 
;)



| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



Re: Using NFS without HDFS

2008-04-11 Thread Owen O'Malley


On Apr 11, 2008, at 7:43 AM, slitz wrote:

I've read in the archive that it should be possible to use any  
distributed

filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...


It is possible. The performance will be much lower on large clusters,  
but it will work. Just use file:///path/to/my/data/input as the input  
path. It also works for output paths. Note that this assumes that the  
nfs file system has consistent names across the cluster.


-- Owen


Re: Using NFS without HDFS

2008-04-11 Thread slitz
I've read in the archive that it should be possible to use any distributed
filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...


slitz


On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:

> Hello ,
>
> To execute Hadoop Map-Reduce job input data should be on HDFS not on
> NFS.
>
> Thanks
>
> ---
> Peeyush
>
>
>
> On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:
>
> > Hello,
> > I'm trying to assemble a simple setup of 3 nodes using NFS as
> Distributed
> > Filesystem.
> >
> > Box A: 192.168.2.3, this box is either the NFS server and working as a
> slave
> > node
> > Box B: 192.168.2.30, this box is only JobTracker
> > Box C: 192.168.2.31, this box is only slave
> >
> > Obviously all three nodes can access the NFS shared, and the path to the
> > share is /home/slitz/warehouse in all three.
> >
> > My hadoop-site.xml file were copied over all nodes and looks like this:
> >
> > 
> >
> > 
> >
> > fs.default.name
> >
> >  local
> >
> > 
> >
> >  The name of the default file system. Either the literal string
> >
> > "local" or a host:port for NDFS.
> >
> >  
> >
> > 
> >
> >  
> >
> > mapred.job.tracker
> >
> >  192.168.2.30:9001
> >
> > 
> >
> >  The host and port that the MapReduce job
> >
> > tracker runs at. If "local", then jobs are
> >
> >  run in-process as a single map and reduce task.
> >
> > 
> >
> >  
> >
> > 
> >
> > mapred.system.dir
> >
> >  /home/slitz/warehouse/hadoop_service/system
> >
> > omgrotfcopterlol.
> >
> >  
> >
> > 
> >
> >
> > As one can see, i'm not using HDFS at all.
> > (Because all the free space i have is located in only one node, so using
> > HDFS would be unnecessary overhead)
> >
> > I've copied the input folder from hadoop to /home/slitz/warehouse/input.
> > When i try to run the example line
> >
> > bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
> > /home/slitz/warehouse/output 'dfs[a-z.]+'
> >
> > the job starts and finish okay but at the end i get this error:
> >
> > org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist
> :
> > /home/slitz/hadoop-0.15.3/grep-temp-141595661
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> > (...the error stack continues...)
> >
> > i don't know why the input path being looked is in the local path
> > /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
> >
> > Maybe something is missing in my hadoop-site.xml?
> >
> >
> >
> > slitz
>


[HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Alfonso Olias Sanz
Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks


Re: Does any one tried to build Hadoop..

2008-04-11 Thread Khalil Honsali
what is your java version? also please describe exactly what you've done

On 11/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
>
> I Tried in both ways i am still i am getting some errors
>
> --- import org.apache.tools.ant.BuildException; (error: cannot be
> resolved..)
> --- public Socket createSocket() throws IOException {
> --- s = socketFactory.createSocket(); (error:  incorrect parameters)
>
> earlier it failed to resolve this package (javax.net.SocketFactory;) then
> i add that jar file in project.
>
> Thanks & Regards,
> Krishna.
>
>
> - Original Message 
> From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, 10 April, 2008 4:07:34 PM
> Subject: Re: Does any one tried to build Hadoop..
>
> At the root of the source and it's called build.xml
>
> Jean-Daniel
>
> 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> >
> > Mr. Jean-Daniel,
> >
> > where is the ant script please?
> >
> >
> > On 10/04/2008, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> > >
> > > The ANT script works well also.
> > >
> > > Jean-Daniel
> > >
> > > 2008/4/9, Khalil Honsali <[EMAIL PROTECTED]>:
> > >
> > > >
> > > > Hi,
> > > > With eclise it's easy, you just have to add it as a new project,
> make
> > > sure
> > > > you add all libraries in folder lib and should compile fine
> > > > There is also an eclipse plugin for running hadoop jobs directly
> from
> > > > eclipse on an installed hadoop .
> > > >
> > > >
> > > > On 10/04/2008, krishna prasanna <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > >
> > > > > Does any one tried to build Hadoop ?
> > > > >
> > > > > Thanks & Regards,
> > > > > Krishna.
> > > > >
> > > > >
> > > > >  Meet people who discuss and share your passions. Go to
> > > > > http://in.promos.yahoo.com/groups/bestofyahoo/
> > > >
> > >
> >
> >
> >
> >
> > --
> >
>
>
>
>   Bring your gang together. Do your thing. Find your favourite Yahoo!
> group at http://in.promos.yahoo.com/groups/


RE: What's the proper way to use hadoop task side-effect files?

2008-04-11 Thread Runping Qi


Look like you use your reducer class as the combiner.
The combiner will be called from mappers, potentially for multiple
times.
 
If you want to create side files in reducer, you cannot use that class
as the combiner.

Runping


> -Original Message-
> From: Zhang, jian [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 10, 2008 11:17 PM
> To: core-user@hadoop.apache.org
> Subject: What's the proper way to use hadoop task side-effect files?
> 
> Hi,
> 
> I was new to hadoop. Sorry for my novice question.
> I got some problem while I was trying to use task side-effect files.
> Since there is no code example in wiki, I tried this way:
> 
> I override cofigure method in reducer to create a side file,
> 
>  public void configure(JobConf conf){
>  logger.info("Tring to create sideFiles inside reducer.!");
> 
>  Path workpath=conf.getOutputPath();
>  Path sideFile= new Path(workpath,"SideFile.txt");
>  try {
>FileSystem fs = FileSystem.get(conf);
>out= fs.create(sideFile);
>  } catch (IOException e) {
>logger.error("Failed to create sidefile!");
>  }
>  }
> And try to use it in reducer.
> 
> But I got some strange problems,
> Even If the method is in reducer Class, mapper tasks are creating the
> side files.
> Mapper tasks hang because there are tring to recreate the file.
> 
> org.apache.hadoop.dfs.AlreadyBeingCreatedException:
>  failed to create file
>
/data/input/MID06/_temporary/_task_200804112315_0001_m_08_0/SideFile
> .txt for DFSClient_task_200804112315_0001_m_08_0 on client
> 192.168.0.203 because current leaseholder is trying to recreate file.
>  at
>
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:9
> 74)
>  at
> org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931)
>  at org.apache.hadoop.dfs.NameNode.create(NameNode.java:281)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>  at
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> a:39)
>  at
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> Impl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:585)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:899)
> 
> 
> Can anybody help me on this, how to use side-effect files?
> 
> 
> 
> Best Regards
> 
> Jian Zhang



Re: Using NFS without HDFS

2008-04-11 Thread Peeyush Bishnoi
Hello ,

To execute Hadoop Map-Reduce job input data should be on HDFS not on
NFS. 

Thanks

---
Peeyush



On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:

> Hello,
> I'm trying to assemble a simple setup of 3 nodes using NFS as Distributed
> Filesystem.
> 
> Box A: 192.168.2.3, this box is either the NFS server and working as a slave
> node
> Box B: 192.168.2.30, this box is only JobTracker
> Box C: 192.168.2.31, this box is only slave
> 
> Obviously all three nodes can access the NFS shared, and the path to the
> share is /home/slitz/warehouse in all three.
> 
> My hadoop-site.xml file were copied over all nodes and looks like this:
> 
> 
> 
> 
> 
> fs.default.name
> 
>  local
> 
> 
> 
>  The name of the default file system. Either the literal string
> 
> "local" or a host:port for NDFS.
> 
>  
> 
> 
> 
>  
> 
> mapred.job.tracker
> 
>  192.168.2.30:9001
> 
> 
> 
>  The host and port that the MapReduce job
> 
> tracker runs at. If "local", then jobs are
> 
>  run in-process as a single map and reduce task.
> 
> 
> 
>  
> 
> 
> 
> mapred.system.dir
> 
>  /home/slitz/warehouse/hadoop_service/system
> 
> omgrotfcopterlol.
> 
>  
> 
> 
> 
> 
> As one can see, i'm not using HDFS at all.
> (Because all the free space i have is located in only one node, so using
> HDFS would be unnecessary overhead)
> 
> I've copied the input folder from hadoop to /home/slitz/warehouse/input.
> When i try to run the example line
> 
> bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
> /home/slitz/warehouse/output 'dfs[a-z.]+'
> 
> the job starts and finish okay but at the end i get this error:
> 
> org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist :
> /home/slitz/hadoop-0.15.3/grep-temp-141595661
> at
> org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
> (...the error stack continues...)
> 
> i don't know why the input path being looked is in the local path
> /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
> 
> Maybe something is missing in my hadoop-site.xml?
> 
> 
> 
> slitz


Hadoop performance in PC cluster

2008-04-11 Thread Yingyuan Cheng

Does anyone run Hadoop in PC cluster?

I just tested WordCount in PC cluster, and my first impression as following:

***

Number of PCs: 7(512M RAM, 2.8G CPU, 100M NIC, CentOS 5.0, Handoop
0.16.1, Sun jre 1.6)
Master(Namenode): 1
Master(Jobtracker): 1
Slaves(Datanode & Tasktracker): 5

1. Writing to HDFS
--

File size: 4,295,341,065 bytes(4.1G)
Time elapsed putting file into HDFS: 7m57.757s
Average rate: 8,990,583 bytes/sec
Average bandwidth usage: 68.59%

I also tested libhdfs, it's just as fine as java.


2. Map/Reduce with Java
--

Time elapsed: 19mins, 56sec
Bytes/time rate: 3,591,422 bytes/sec

Job Counters:
Launched map tasks 67
Launched reduce tasks 7
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,923,360
Map input bytes 4,295,341,065
Map output bytes 6,504,944,565
Combine input records 697,923,360
Combine output records 2,330,048
Reduce input groups 5,201
Reduce input records 2,330,048
Reduce output records 5,201

It's acceptable. The main bottleneck was CPU, keeping 100% usage.


3. Map/Reduce with C++ Pipe(No combiner)
--

Time elapsed: 1hrs, 2mins, 47sec
Bytes/time rate: 1,140,255 bytes/sec

Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191

As my first impression, C++ pipe interface is slower than Java. If I add
C++ pipe combiner, the result become even worse: The main bottleneck is
RAM, a great deal of swapping space used, processes blocked, CPU keeping
waiting...

Adding more RAM maybe improve performance, but still slower than Java, I
think.


4. Map/Reduce with Python streaming(No combiner)
--

Time elapsed: 1hrs, 48mins, 53sec
Bytes/time rate: 657,483 bytes/sec

Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191

As you see, the result is not as good as C++ pipe interface. Maybe
python is slower, I didn't test other cases.

Are there any suggestions to improve such situation?



--
yingyuan



Using NFS without HDFS

2008-04-11 Thread slitz
Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as Distributed
Filesystem.

Box A: 192.168.2.3, this box is either the NFS server and working as a slave
node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave

Obviously all three nodes can access the NFS shared, and the path to the
share is /home/slitz/warehouse in all three.

My hadoop-site.xml file were copied over all nodes and looks like this:





fs.default.name

 local



 The name of the default file system. Either the literal string

"local" or a host:port for NDFS.

 



 

mapred.job.tracker

 192.168.2.30:9001



 The host and port that the MapReduce job

tracker runs at. If "local", then jobs are

 run in-process as a single map and reduce task.



 



mapred.system.dir

 /home/slitz/warehouse/hadoop_service/system

omgrotfcopterlol.

 




As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so using
HDFS would be unnecessary overhead)

I've copied the input folder from hadoop to /home/slitz/warehouse/input.
When i try to run the example line

bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'

the job starts and finish okay but at the end i get this error:

org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist :
/home/slitz/hadoop-0.15.3/grep-temp-141595661
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)

i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)

Maybe something is missing in my hadoop-site.xml?



slitz


mailing list archive broken?

2008-04-11 Thread Adrian Woodhead

I've noticed that the mailing lists archives seem to be broken here:

http://hadoop.apache.org/mail/core-user/

I get a 403 forbidden. Any idea what's going on?

Regards,

Adrian



答复: Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Zhang, jian
Hi,

Please read this, you need to implement partitioner.
It controls which key is sent to which reducer, if u want to get unique key 
result, you need to implement partitioner and the compareTO function should 
work properly. 
[WIKI]
Partitioner

Partitioner partitions the key space.

Partitioner controls the partitioning of the keys of the intermediate 
map-outputs. The key (or a subset of the key) is used to derive the partition, 
typically by a hash function. The total number of partitions is the same as the 
number of reduce tasks for the job. Hence this controls which of the m reduce 
tasks the intermediate key (and hence the record) is sent to for reduction.

HashPartitioner is the default Partitioner.



Best Regards

Jian Zhang


-邮件原件-
发件人: Harish Mallipeddi [mailto:[EMAIL PROTECTED] 
发送时间: 2008年4月11日 19:06
收件人: core-user@hadoop.apache.org
主题: Problem with key aggregation when number of reduce tasks is more than 1

Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-0, output/part-1, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Harish Mallipeddi
Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-0, output/part-1, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/