Re: Hadoop performance on EC2?

2008-04-11 Thread Ted Dziuba
I have seen EC2 be slower than a comparable system in development, but 
not by the factors that you're experiencing.  One thing about EC2 that 
has concerned me - you are not guaranteed that your /mnt disk is an 
uncontested spindle.  Early on, this was the case, but Amazon made no 
promises.


Also, and this may be a stupid question, are you sure that you're using 
the same JVM in EC2 and development?  GCJ is much slower than Sun's JVM.


Ted

Nate Carlson wrote:

On Thu, 10 Apr 2008, Ted Dunning wrote:

Are you trying to read from mySQL?


No, we're outputting to MySQL. I've also verified that the MySQL 
server is hardly seeing any load, isn't waiting on slow queries, etc.


If so, it isn't very surprising that you could get lower performance 
with more readers.


Indeed!


| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|





What's the proper way to use hadoop task side-effect files?

2008-04-11 Thread Zhang, jian
Hi,

I was new to hadoop. Sorry for my novice question.
I got some problem while I was trying to use task side-effect files.
Since there is no code example in wiki, I tried this way:

I override cofigure method in reducer to create a side file,

 public void configure(JobConf conf){
 logger.info(Tring to create sideFiles inside reducer.!);

 Path workpath=conf.getOutputPath();
 Path sideFile= new Path(workpath,SideFile.txt);
 try {
   FileSystem fs = FileSystem.get(conf);
   out= fs.create(sideFile);
 } catch (IOException e) {
   logger.error(Failed to create sidefile!);
 }
 }
And try to use it in reducer.

But I got some strange problems,
Even If the method is in reducer Class, mapper tasks are creating the
side files.
Mapper tasks hang because there are tring to recreate the file.

org.apache.hadoop.dfs.AlreadyBeingCreatedException:
 failed to create file
/data/input/MID06/_temporary/_task_200804112315_0001_m_08_0/SideFile
.txt for DFSClient_task_200804112315_0001_m_08_0 on client
192.168.0.203 because current leaseholder is trying to recreate file.
 at
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:9
74)
 at
org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931)
 at org.apache.hadoop.dfs.NameNode.create(NameNode.java:281)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:899)


Can anybody help me on this, how to use side-effect files? 



Best Regards

Jian Zhang



Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Harish Mallipeddi
Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-0, output/part-1, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


答复: Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Zhang, jian
Hi,

Please read this, you need to implement partitioner.
It controls which key is sent to which reducer, if u want to get unique key 
result, you need to implement partitioner and the compareTO function should 
work properly. 
[WIKI]
Partitioner

Partitioner partitions the key space.

Partitioner controls the partitioning of the keys of the intermediate 
map-outputs. The key (or a subset of the key) is used to derive the partition, 
typically by a hash function. The total number of partitions is the same as the 
number of reduce tasks for the job. Hence this controls which of the m reduce 
tasks the intermediate key (and hence the record) is sent to for reduction.

HashPartitioner is the default Partitioner.



Best Regards

Jian Zhang


-邮件原件-
发件人: Harish Mallipeddi [mailto:[EMAIL PROTECTED] 
发送时间: 2008年4月11日 19:06
收件人: core-user@hadoop.apache.org
主题: Problem with key aggregation when number of reduce tasks is more than 1

Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-0, output/part-1, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

-- 
Harish Mallipeddi
circos.com : poundbang.in/blog/


mailing list archive broken?

2008-04-11 Thread Adrian Woodhead

I've noticed that the mailing lists archives seem to be broken here:

http://hadoop.apache.org/mail/core-user/

I get a 403 forbidden. Any idea what's going on?

Regards,

Adrian



Using NFS without HDFS

2008-04-11 Thread slitz
Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as Distributed
Filesystem.

Box A: 192.168.2.3, this box is either the NFS server and working as a slave
node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave

Obviously all three nodes can access the NFS shared, and the path to the
share is /home/slitz/warehouse in all three.

My hadoop-site.xml file were copied over all nodes and looks like this:

configuration

property

namefs.default.name/name

 valuelocal/value

description

 The name of the default file system. Either the literal string

local or a host:port for NDFS.

 /description

/property

 property

namemapred.job.tracker/name

 value192.168.2.30:9001/value

description

 The host and port that the MapReduce job

tracker runs at. If local, then jobs are

 run in-process as a single map and reduce task.

/description

 /property

property

namemapred.system.dir/name

 value/home/slitz/warehouse/hadoop_service/system/value

descriptionomgrotfcopterlol./description

 /property

/configuration


As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so using
HDFS would be unnecessary overhead)

I've copied the input folder from hadoop to /home/slitz/warehouse/input.
When i try to run the example line

bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'

the job starts and finish okay but at the end i get this error:

org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist :
/home/slitz/hadoop-0.15.3/grep-temp-141595661
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)

i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)

Maybe something is missing in my hadoop-site.xml?



slitz


Hadoop performance in PC cluster

2008-04-11 Thread Yingyuan Cheng

Does anyone run Hadoop in PC cluster?

I just tested WordCount in PC cluster, and my first impression as following:

***

Number of PCs: 7(512M RAM, 2.8G CPU, 100M NIC, CentOS 5.0, Handoop
0.16.1, Sun jre 1.6)
Master(Namenode): 1
Master(Jobtracker): 1
Slaves(Datanode  Tasktracker): 5

1. Writing to HDFS
--

File size: 4,295,341,065 bytes(4.1G)
Time elapsed putting file into HDFS: 7m57.757s
Average rate: 8,990,583 bytes/sec
Average bandwidth usage: 68.59%

I also tested libhdfs, it's just as fine as java.


2. Map/Reduce with Java
--

Time elapsed: 19mins, 56sec
Bytes/time rate: 3,591,422 bytes/sec

Job Counters:
Launched map tasks 67
Launched reduce tasks 7
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,923,360
Map input bytes 4,295,341,065
Map output bytes 6,504,944,565
Combine input records 697,923,360
Combine output records 2,330,048
Reduce input groups 5,201
Reduce input records 2,330,048
Reduce output records 5,201

It's acceptable. The main bottleneck was CPU, keeping 100% usage.


3. Map/Reduce with C++ Pipe(No combiner)
--

Time elapsed: 1hrs, 2mins, 47sec
Bytes/time rate: 1,140,255 bytes/sec

Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191

As my first impression, C++ pipe interface is slower than Java. If I add
C++ pipe combiner, the result become even worse: The main bottleneck is
RAM, a great deal of swapping space used, processes blocked, CPU keeping
waiting...

Adding more RAM maybe improve performance, but still slower than Java, I
think.


4. Map/Reduce with Python streaming(No combiner)
--

Time elapsed: 1hrs, 48mins, 53sec
Bytes/time rate: 657,483 bytes/sec

Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64

Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191

As you see, the result is not as good as C++ pipe interface. Maybe
python is slower, I didn't test other cases.

Are there any suggestions to improve such situation?



--
yingyuan



RE: What's the proper way to use hadoop task side-effect files?

2008-04-11 Thread Runping Qi


Look like you use your reducer class as the combiner.
The combiner will be called from mappers, potentially for multiple
times.
 
If you want to create side files in reducer, you cannot use that class
as the combiner.

Runping


 -Original Message-
 From: Zhang, jian [mailto:[EMAIL PROTECTED]
 Sent: Thursday, April 10, 2008 11:17 PM
 To: core-user@hadoop.apache.org
 Subject: What's the proper way to use hadoop task side-effect files?
 
 Hi,
 
 I was new to hadoop. Sorry for my novice question.
 I got some problem while I was trying to use task side-effect files.
 Since there is no code example in wiki, I tried this way:
 
 I override cofigure method in reducer to create a side file,
 
  public void configure(JobConf conf){
  logger.info(Tring to create sideFiles inside reducer.!);
 
  Path workpath=conf.getOutputPath();
  Path sideFile= new Path(workpath,SideFile.txt);
  try {
FileSystem fs = FileSystem.get(conf);
out= fs.create(sideFile);
  } catch (IOException e) {
logger.error(Failed to create sidefile!);
  }
  }
 And try to use it in reducer.
 
 But I got some strange problems,
 Even If the method is in reducer Class, mapper tasks are creating the
 side files.
 Mapper tasks hang because there are tring to recreate the file.
 
 org.apache.hadoop.dfs.AlreadyBeingCreatedException:
  failed to create file

/data/input/MID06/_temporary/_task_200804112315_0001_m_08_0/SideFile
 .txt for DFSClient_task_200804112315_0001_m_08_0 on client
 192.168.0.203 because current leaseholder is trying to recreate file.
  at

org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:9
 74)
  at
 org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931)
  at org.apache.hadoop.dfs.NameNode.create(NameNode.java:281)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
  at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)
  at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:585)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:899)
 
 
 Can anybody help me on this, how to use side-effect files?
 
 
 
 Best Regards
 
 Jian Zhang



[HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Alfonso Olias Sanz
Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks


Re: Using NFS without HDFS

2008-04-11 Thread slitz
I've read in the archive that it should be possible to use any distributed
filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...


slitz


On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi [EMAIL PROTECTED]
wrote:

 Hello ,

 To execute Hadoop Map-Reduce job input data should be on HDFS not on
 NFS.

 Thanks

 ---
 Peeyush



 On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:

  Hello,
  I'm trying to assemble a simple setup of 3 nodes using NFS as
 Distributed
  Filesystem.
 
  Box A: 192.168.2.3, this box is either the NFS server and working as a
 slave
  node
  Box B: 192.168.2.30, this box is only JobTracker
  Box C: 192.168.2.31, this box is only slave
 
  Obviously all three nodes can access the NFS shared, and the path to the
  share is /home/slitz/warehouse in all three.
 
  My hadoop-site.xml file were copied over all nodes and looks like this:
 
  configuration
 
  property
 
  namefs.default.name/name
 
   valuelocal/value
 
  description
 
   The name of the default file system. Either the literal string
 
  local or a host:port for NDFS.
 
   /description
 
  /property
 
   property
 
  namemapred.job.tracker/name
 
   value192.168.2.30:9001/value
 
  description
 
   The host and port that the MapReduce job
 
  tracker runs at. If local, then jobs are
 
   run in-process as a single map and reduce task.
 
  /description
 
   /property
 
  property
 
  namemapred.system.dir/name
 
   value/home/slitz/warehouse/hadoop_service/system/value
 
  descriptionomgrotfcopterlol./description
 
   /property
 
  /configuration
 
 
  As one can see, i'm not using HDFS at all.
  (Because all the free space i have is located in only one node, so using
  HDFS would be unnecessary overhead)
 
  I've copied the input folder from hadoop to /home/slitz/warehouse/input.
  When i try to run the example line
 
  bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
  /home/slitz/warehouse/output 'dfs[a-z.]+'
 
  the job starts and finish okay but at the end i get this error:
 
  org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist
 :
  /home/slitz/hadoop-0.15.3/grep-temp-141595661
  at
 
 org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
  (...the error stack continues...)
 
  i don't know why the input path being looked is in the local path
  /home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
 
  Maybe something is missing in my hadoop-site.xml?
 
 
 
  slitz



Re: Hadoop performance on EC2?

2008-04-11 Thread Nate Carlson

On Thu, 10 Apr 2008, Ted Dziuba wrote:
I have seen EC2 be slower than a comparable system in development, but 
not by the factors that you're experiencing.  One thing about EC2 that 
has concerned me - you are not guaranteed that your /mnt disk is an 
uncontested spindle. Early on, this was the case, but Amazon made no 
promises.


Interesting! My understand was that it was. We were using S3 for storage 
before, and switched to HDFS, and saw similar performance on both for our 
needs.. we're more CPU intensive than I/O intensive.


Also, and this may be a stupid question, are you sure that you're using 
the same JVM in EC2 and development?  GCJ is much slower than Sun's JVM.


Yeah - our code actually requires Sun's Java6u5 JVM.. it won't run on gcj. 
;)



| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



Re: Using NFS without HDFS

2008-04-11 Thread Luca

slitz wrote:

I've read in the archive that it should be possible to use any distributed
filesystem since the data is available to all nodes, so it should be
possible to use NFS, right?
I've also read somewere in the archive that this shoud be possible...



As far as I know, you can refer to any file on a mounted file system 
(visible from all compute nodes) using the prefix file:// before the 
full path, unless another prefix has been specified.


Cheers,
Luca



slitz


On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi [EMAIL PROTECTED]
wrote:


Hello ,

To execute Hadoop Map-Reduce job input data should be on HDFS not on
NFS.

Thanks

---
Peeyush



On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:


Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as

Distributed

Filesystem.

Box A: 192.168.2.3, this box is either the NFS server and working as a

slave

node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave

Obviously all three nodes can access the NFS shared, and the path to the
share is /home/slitz/warehouse in all three.

My hadoop-site.xml file were copied over all nodes and looks like this:

configuration

property

namefs.default.name/name

 valuelocal/value

description

 The name of the default file system. Either the literal string

local or a host:port for NDFS.

 /description

/property

 property

namemapred.job.tracker/name

 value192.168.2.30:9001/value

description

 The host and port that the MapReduce job

tracker runs at. If local, then jobs are

 run in-process as a single map and reduce task.

/description

 /property

property

namemapred.system.dir/name

 value/home/slitz/warehouse/hadoop_service/system/value

descriptionomgrotfcopterlol./description

 /property

/configuration


As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so using
HDFS would be unnecessary overhead)

I've copied the input folder from hadoop to /home/slitz/warehouse/input.
When i try to run the example line

bin/hadoop jar hadoop-*-examples.jar grep /home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'

the job starts and finish okay but at the end i get this error:

org.apache.hadoop.mapred.InvalidInputException: Input path doesn't exist

:

/home/slitz/hadoop-0.15.3/grep-temp-141595661
at


org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)

i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)

Maybe something is missing in my hadoop-site.xml?



slitz







MiniDFSCluster error on windows.

2008-04-11 Thread Edward J. Yoon
It occurs only on windows system. (cygwin)
Does anyone have the solution?


Testcase: testCosine took 0.708 sec
Caused an ERROR
Address family not supported by protocol family: bind
java.net.SocketException: Address family not supported by protocol family: bind
at sun.nio.ch.Net.bind(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:
119)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
at org.apache.hadoop.ipc.Server.bind(Server.java:182)
at org.apache.hadoop.ipc.Server$Listener.init(Server.java:243)
at org.apache.hadoop.ipc.Server.init(Server.java:963)
at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:393)
at org.apache.hadoop.ipc.RPC.getServer(RPC.java:355)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:122)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:177)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:163)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:866)
at org.apache.hadoop.dfs.MiniDFSCluster.init(MiniDFSCluster.java:264)
at org.apache.hadoop.dfs.MiniDFSCluster.init(MiniDFSCluster.java:113)

-- 
B. Regards,
Edward J. Yoon


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
One way to do this is to write your own (file) input format. See 
src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to 
override listPaths() in order to have selectivity amongst the files in 
the input folder.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, 
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on 
PathFilter interface.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

 I saw the same OOM error in my map-reduce job in the map phase. 

1. I tried changing mapred.child.java.opts (bumped to 600M) 
2. io.sort.mb was kept at 100MB. 

I see the same errors still. 

I checked with debug the size of keyValBuffer in collect(), that is always
less than io.sort.mb and is spilled to disk properly.

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks. It helps for a while as the map job went a bit
far (56% from 5%) but still see the problem.

 I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

 I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 

What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 

Application: 

My map-reduce job is running with about 2G of input. in the Mapper phase I
read each line and output [5-500] (key,value) pair. so the intermediate data
should be really blown up.  will that be a problem. 

The Error file is attached
http://www.nabble.com/file/p16628181/error.txt error.txt 
-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628181p16628181.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Using NFS without HDFS

2008-04-11 Thread slitz
Thank you for the file:/// tip, i was not including it in the paths.
I'm running the example with this line - bin/hadoop jar
hadoop-*-examples.jar grep file:///home/slitz/warehouse/input
file:///home/slitz/warehouse/output 'dfs[a-z.]+'

But i'm getting the same error as before, i'm getting

org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : *
/home/slitz/hadoop-0.15.3/grep-temp-1030179831*
at
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...stack continues...)

i think the problem may be the input path, it should be pointing to some
path in the nfs share, right?

the grep-temp-* dir is being created in the HADOOP_HOME of Box A (
192.168.2.3).

slitz

On Fri, Apr 11, 2008 at 4:06 PM, Luca [EMAIL PROTECTED] wrote:

 slitz wrote:

  I've read in the archive that it should be possible to use any
  distributed
  filesystem since the data is available to all nodes, so it should be
  possible to use NFS, right?
  I've also read somewere in the archive that this shoud be possible...
 
 
 As far as I know, you can refer to any file on a mounted file system
 (visible from all compute nodes) using the prefix file:// before the full
 path, unless another prefix has been specified.

 Cheers,
 Luca



  slitz
 
 
  On Fri, Apr 11, 2008 at 1:43 PM, Peeyush Bishnoi [EMAIL PROTECTED]
  
  wrote:
 
   Hello ,
  
   To execute Hadoop Map-Reduce job input data should be on HDFS not on
   NFS.
  
   Thanks
  
   ---
   Peeyush
  
  
  
   On Fri, 2008-04-11 at 12:40 +0100, slitz wrote:
  
Hello,
I'm trying to assemble a simple setup of 3 nodes using NFS as
   
   Distributed
  
Filesystem.
   
Box A: 192.168.2.3, this box is either the NFS server and working as
a
   
   slave
  
node
Box B: 192.168.2.30, this box is only JobTracker
Box C: 192.168.2.31, this box is only slave
   
Obviously all three nodes can access the NFS shared, and the path to
the
share is /home/slitz/warehouse in all three.
   
My hadoop-site.xml file were copied over all nodes and looks like
this:
   
configuration
   
property
   
namefs.default.name/name
   
 valuelocal/value
   
description
   
 The name of the default file system. Either the literal string
   
local or a host:port for NDFS.
   
 /description
   
/property
   
 property
   
namemapred.job.tracker/name
   
 value192.168.2.30:9001/value
   
description
   
 The host and port that the MapReduce job
   
tracker runs at. If local, then jobs are
   
 run in-process as a single map and reduce task.
   
/description
   
 /property
   
property
   
namemapred.system.dir/name
   
 value/home/slitz/warehouse/hadoop_service/system/value
   
descriptionomgrotfcopterlol./description
   
 /property
   
/configuration
   
   
As one can see, i'm not using HDFS at all.
(Because all the free space i have is located in only one node, so
using
HDFS would be unnecessary overhead)
   
I've copied the input folder from hadoop to
/home/slitz/warehouse/input.
When i try to run the example line
   
bin/hadoop jar hadoop-*-examples.jar grep
/home/slitz/warehouse/input/
/home/slitz/warehouse/output 'dfs[a-z.]+'
   
the job starts and finish okay but at the end i get this error:
   
org.apache.hadoop.mapred.InvalidInputException: Input path doesn't
exist
   
   :
  
/home/slitz/hadoop-0.15.3/grep-temp-141595661
at
   
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
  
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:508)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
(...the error stack continues...)
   
i don't know why the input path being looked is in the local path
/home/slitz/hadoop(...) instead of /home/slitz/warehouse/(...)
   
Maybe something is missing in my hadoop-site.xml?
   
   
   
slitz
   
  
 




Re: 答复: Problem with key aggregation when number of reduce tasks is more than 1

2008-04-11 Thread Pete Wyckoff

Yes and as such, we've found better load balancing when the #of reduces is a
prime #.  Although the string.hashCode isn't great for short strings.


On 4/11/08 4:16 AM, Zhang, jian [EMAIL PROTECTED] wrote:

 Hi,
 
 Please read this, you need to implement partitioner.
 It controls which key is sent to which reducer, if u want to get unique key
 result, you need to implement partitioner and the compareTO function should
 work properly. 
 [WIKI]
 Partitioner
 
 Partitioner partitions the key space.
 
 Partitioner controls the partitioning of the keys of the intermediate
 map-outputs. The key (or a subset of the key) is used to derive the partition,
 typically by a hash function. The total number of partitions is the same as
 the number of reduce tasks for the job. Hence this controls which of the m
 reduce tasks the intermediate key (and hence the record) is sent to for
 reduction.
 
 HashPartitioner is the default Partitioner.
 
 
 
 Best Regards
 
 Jian Zhang
 
 
 -邮件原件-
 发件人: Harish Mallipeddi [mailto:[EMAIL PROTECTED]
 发送时间: 2008年4月11日 19:06
 收件人: core-user@hadoop.apache.org
 主题: Problem with key aggregation when number of reduce tasks is more than 1
 
 Hi all,
 
 I wrote a custom key class (implements WritableComparable) and implemented
 the compareTo() method inside this class. Everything works fine when I run
 the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
 correctly in the output files.
 
 But when I increase the number of reduce tasks, keys don't get aggregated
 properly; same keys seem to end up in separate output files
 (output/part-0, output/part-1, etc). This should not happen because
 right before reduce() gets called, all (k,v) pairs from all map outputs with
 the same 'k' are aggregated and the reduce function just iterates over the
 values (v1, v2, etc)?
 
 Do I need to implement anything else inside my custom key class other than
 compareTo? I also tried implementing equals() but that didn't help either.
 Then I came across setOutputKeyComparator(). So I added a custom Comparator
 class inside the key class and tried setting this on the JobConf object. But
 that didn't work either. What could be wrong?
 
 Cheers,



Re: Hadoop performance on EC2?

2008-04-11 Thread Nate Carlson

On Wed, 9 Apr 2008, Chris K Wensel wrote:
make sure all nodes are running in the same 'availability zone', 
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347


check!


and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101


check!

also, make sure each node is addressing its peers via the ec2 private 
addresses, not the public ones.


check!

there is a patch in jira for the ec2/contrib scripts that address these 
issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display 
showing utilization on the machines. 8/7 map/reducers sounds like alot.


Reduced - I dropped it to 3/2 for testing.

I am using these scripts now, and am still seeing very poor performance on 
EC2 compared to my development environment.  ;(


I'll be capturing some more extensive stats over the weekend, and see if I 
can glean anything useful...



| nate carlson | [EMAIL PROTECTED] | http://www.natecarlson.com |
|   depriving some poor village of its idiot since 1981|



Re: Hadoop performance on EC2?

2008-04-11 Thread Chris K Wensel

What does ganglia show for load and network?

You should also be able to see gc stats (count and time). Might help  
as well.


fyi,
running
 hadoop-ec2 proxy cluster-name

will both setup a socks tunnel and list available urls you can cut/ 
paste into your browser. one of the urls is for the ganglia interface.


On Apr 11, 2008, at 2:01 PM, Nate Carlson wrote:

On Wed, 9 Apr 2008, Chris K Wensel wrote:

make sure all nodes are running in the same 'availability zone', 
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347


check!


and that you are using the new xen kernels.
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1353categoryID=101
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1354categoryID=101


check!

also, make sure each node is addressing its peers via the ec2  
private addresses, not the public ones.


check!

there is a patch in jira for the ec2/contrib scripts that address  
these issues.

https://issues.apache.org/jira/browse/HADOOP-2410

if you use those scripts, you will be able to see a ganglia display  
showing utilization on the machines. 8/7 map/reducers sounds like  
alot.


Reduced - I dropped it to 3/2 for testing.

I am using these scripts now, and am still seeing very poor  
performance on EC2 compared to my development environment.  ;(


I'll be capturing some more extensive stats over the weekend, and  
see if I can glean anything useful...



| nate carlson | [EMAIL PROTECTED] | http:// 
www.natecarlson.com |
|   depriving some poor village of its idiot since  
1981|




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: could only be replicated to 0 nodes, instead of 1

2008-04-11 Thread Raghu Angadi

jerrro wrote:


I couldn't find much information about this error, but I did manage to see
somewhere it might mean that there are no datanodes running. But as I said,
start-all does not give any errors. Any ideas what could be problem?


start-all return does not mean datanodes are ok. Did you check if there 
are any datanodes alive? You can check from http://namenode:50070/.


Raghu.



Re: Does any one tried to build Hadoop..

2008-04-11 Thread Khalil Honsali
my guess it's an import problem..
how about changing 2) to version 6 for compiler version?

On 12/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:

 Java version
 java version 1.6.0_05
 Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
 Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)

 Steps that i did :
 1) Opened a new java project in Eclipse. (From existing directory path).
 2) Modified Java compiler version as 5 in project properties in order
 solve (source level 5 error).
 3) I found that package javax.net.SocketFactory is not resolved then i
 downloaded that package and add to external jars.

 then i got error mentioned below.


 Thanks  Regards,
 Krishna

 - Original Message 

 From: Khalil Honsali [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org

 Sent: Friday, 11 April, 2008 6:54:46 PM
 Subject: Re: Does any one tried to build Hadoop..

 what is your java version? also please describe exactly what you've done

 On 11/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
 
  I Tried in both ways i am still i am getting some errors
 
  --- import org.apache.tools.ant.BuildException; (error: cannot be
  resolved..)
  --- public Socket createSocket() throws IOException {
  --- s = socketFactory.createSocket(); (error:  incorrect parameters)
 
  earlier it failed to resolve this package (javax.net.SocketFactory;)
 then
  i add that jar file in project.
 
  Thanks  Regards,
  Krishna.
 
 
  - Original Message 
  From: Jean-Daniel Cryans [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, 10 April, 2008 4:07:34 PM
  Subject: Re: Does any one tried to build Hadoop..
 
  At the root of the source and it's called build.xml
 
  Jean-Daniel
 
  2008/4/9, Khalil Honsali [EMAIL PROTECTED]:
  
   Mr. Jean-Daniel,
  
   where is the ant script please?
  
  
   On 10/04/2008, Jean-Daniel Cryans [EMAIL PROTECTED] wrote:
   
The ANT script works well also.
   
Jean-Daniel
   
2008/4/9, Khalil Honsali [EMAIL PROTECTED]:
   

 Hi,
 With eclise it's easy, you just have to add it as a new project,
  make
sure
 you add all libraries in folder lib and should compile fine
 There is also an eclipse plugin for running hadoop jobs directly
  from
 eclipse on an installed hadoop .


 On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
 
 
  Does any one tried to build Hadoop ?
 
  Thanks  Regards,
  Krishna.
 
 
   Meet people who discuss and share your passions. Go to
  http://in.promos.yahoo.com/groups/bestofyahoo/

   
  
  
  
  
   --
  
 
 
 
Bring your gang together. Do your thing. Find your favourite
 Yahoo!
  group at http://in.promos.yahoo.com/groups/







   Unlimited freedom, unlimited storage. Get it now, on
 http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/


Re: Does any one tried to build Hadoop..

2008-04-11 Thread Khalil Honsali
I now understand your problem, I replicated it.
If you load the build.xml from eclipse, and go to the propertiesbuild
pathLibraries, you'll find a JRE_LIB, remove that one and add JRE System
Library.
hope it solves it.

On 12/04/2008, Khalil Honsali [EMAIL PROTECTED] wrote:

 my guess it's an import problem..
 how about changing 2) to version 6 for compiler version?

 On 12/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
 
  Java version
  java version 1.6.0_05
  Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
  Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
 
  Steps that i did :
  1) Opened a new java project in Eclipse. (From existing directory path).
  2) Modified Java compiler version as 5 in project properties in order
  solve (source level 5 error).
  3) I found that package javax.net.SocketFactory is not resolved then i
  downloaded that package and add to external jars.
 
  then i got error mentioned below.
 
 
  Thanks  Regards,
  Krishna
 
  - Original Message 
 
  From: Khalil Honsali [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
 
  Sent: Friday, 11 April, 2008 6:54:46 PM
  Subject: Re: Does any one tried to build Hadoop..
 
  what is your java version? also please describe exactly what you've done
 
  On 11/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
  
   I Tried in both ways i am still i am getting some errors
  
   --- import org.apache.tools.ant.BuildException; (error: cannot be
   resolved..)
   --- public Socket createSocket() throws IOException {
   --- s = socketFactory.createSocket(); (error:  incorrect parameters)
  
   earlier it failed to resolve this package (javax.net.SocketFactory;)
  then
   i add that jar file in project.
  
   Thanks  Regards,
   Krishna.
  
  
   - Original Message 
   From: Jean-Daniel Cryans [EMAIL PROTECTED]
   To: [EMAIL PROTECTED]
   Sent: Thursday, 10 April, 2008 4:07:34 PM
   Subject: Re: Does any one tried to build Hadoop..
  
   At the root of the source and it's called build.xml
  
   Jean-Daniel
  
   2008/4/9, Khalil Honsali [EMAIL PROTECTED]:
   
Mr. Jean-Daniel,
   
where is the ant script please?
   
   
On 10/04/2008, Jean-Daniel Cryans [EMAIL PROTECTED] wrote:

 The ANT script works well also.

 Jean-Daniel

 2008/4/9, Khalil Honsali [EMAIL PROTECTED]:

 
  Hi,
  With eclise it's easy, you just have to add it as a new project,
   make
 sure
  you add all libraries in folder lib and should compile fine
  There is also an eclipse plugin for running hadoop jobs directly
   from
  eclipse on an installed hadoop .
 
 
  On 10/04/2008, krishna prasanna [EMAIL PROTECTED] wrote:
  
  
   Does any one tried to build Hadoop ?
  
   Thanks  Regards,
   Krishna.
  
  
Meet people who discuss and share your passions. Go to
   http://in.promos.yahoo.com/groups/bestofyahoo/
 

   
   
   
   
--
   
  
  
  
 Bring your gang together. Do your thing. Find your favourite
  Yahoo!
   group at http://in.promos.yahoo.com/groups/
 
 
 
 
 
 
 
Unlimited freedom, unlimited storage. Get it now, on
  http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/






--