Re: Handling bad records

2012-02-28 Thread madhu phatak
Hi Mohit ,
 A and B refers to two different output files (multipart name). The file
names will be seq-A* and seq-B*.  Its similar to r in part-r-0

On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Thanks that's helpful. In that example what is A and B referring to? Is
 that the output file name?

 mos.getCollector(seq, A, reporter).collect(key, new Text(Bye));
 mos.getCollector(seq, B, reporter).collect(key, new Text(Chau));


 On Mon, Feb 27, 2012 at 9:53 PM, Harsh J ha...@cloudera.com wrote:

  Mohit,
 
  Use the MultipleOutputs API:
 
 
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
  to have a named output of bad records. There is an example of use
  detailed on the link.
 
  On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   What's the best way to write records to a different file? I am doing
 xml
   processing and during processing I might come accross invalid xml
 format.
   Current I have it under try catch block and writing to log4j. But I
 think
   it would be better to just write it to an output file that just
 contains
   errors.
 
 
 
  --
  Harsh J
 




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: dfs.block.size

2012-02-28 Thread madhu phatak
You can use FileSystem.getFileStatus(Path p) which gives you the block size
specific to a file.

On Tue, Feb 28, 2012 at 2:50 AM, Kai Voigt k...@123.org wrote:

 hadoop fsck filename -blocks is something that I think of quickly.

 http://hadoop.apache.org/common/docs/current/commands_manual.html#fsckhas 
 more details

 Kai

 Am 28.02.2012 um 02:30 schrieb Mohit Anchlia:

  How do I verify the block size of a given file? Is there a command?
 
  On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria j...@cloudera.com
 wrote:
 
  dfs.block.size can be set per job.
 
  mapred.tasktracker.map.tasks.maximum is per tasktracker.
 
  -Joey
 
  On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia mohitanch...@gmail.com
 
  wrote:
  Can someone please suggest if parameters like dfs.block.size,
  mapred.tasktracker.map.tasks.maximum are only cluster wide settings or
  can
  these be set per client job configuration?
 
  On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
  If I want to change the block size then can I use Configuration in
  mapreduce job and set it when writing to the sequence file or does it
  need
  to be cluster wide setting in .xml files?
 
  Also, is there a way to check the block of a given file?
 
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434
 

 --
 Kai Voigt
 k...@123.org







-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Handling bad records

2012-02-28 Thread Subir S
Can multiple output be used with Hadoop Streaming?

On Tue, Feb 28, 2012 at 2:07 PM, madhu phatak phatak@gmail.com wrote:

 Hi Mohit ,
  A and B refers to two different output files (multipart name). The file
 names will be seq-A* and seq-B*.  Its similar to r in part-r-0

 On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Thanks that's helpful. In that example what is A and B referring to?
 Is
  that the output file name?
 
  mos.getCollector(seq, A, reporter).collect(key, new Text(Bye));
  mos.getCollector(seq, B, reporter).collect(key, new Text(Chau));
 
 
  On Mon, Feb 27, 2012 at 9:53 PM, Harsh J ha...@cloudera.com wrote:
 
   Mohit,
  
   Use the MultipleOutputs API:
  
  
 
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
   to have a named output of bad records. There is an example of use
   detailed on the link.
  
   On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch...@gmail.com
 
   wrote:
What's the best way to write records to a different file? I am doing
  xml
processing and during processing I might come accross invalid xml
  format.
Current I have it under try catch block and writing to log4j. But I
  think
it would be better to just write it to an output file that just
  contains
errors.
  
  
  
   --
   Harsh J
  
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/



Re: ClassNotFoundException: -libjars not working?

2012-02-28 Thread madhu phatak
Hi,
 -libjars doesn't always work.Better way is to create a runnable jar with
all dependencies ( if no of dependency is less) or u have to keep the jars
into the lib folder of the hadoop in all machines.

On Wed, Feb 22, 2012 at 8:13 PM, Ioan Eugen Stan stan.ieu...@gmail.comwrote:

 Hello,

 I'm trying to run a map-reduce job and I get ClassNotFoundException, but I
 have the class submitted with -libjars. What's wrong with how I do things?
 Please help.

 I'm running hadoop-0.20.2-cdh3u1, and I have everithing on the -libjars
 line. The job is submitted via a java app like:

  exec /usr/lib/jvm/java-6-sun/bin/**java -Dproc_jar -Xmx200m -server
 -Dhadoop.log.dir=/opt/ui/var/**log/mailsearch
 -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/**hadoop
 -Dhadoop.id.str=hbase -Dhadoop.root.logger=INFO,**console
 -Dhadoop.policy.file=hadoop-**policy.xml -classpath
 '/usr/lib/hadoop/conf:/usr/**lib/jvm/java-6-sun/lib/tools.**
 jar:/usr/lib/hadoop:/usr/lib/**hadoop/hadoop-core-0.20.2-**
 cdh3u1.jar:/usr/lib/hadoop/**lib/ant-contrib-1.0b3.jar:/**
 usr/lib/hadoop/lib/apache-**log4j-extras-1.1.jar:/usr/lib/**
 hadoop/lib/aspectjrt-1.6.5.**jar:/usr/lib/hadoop/lib/**
 aspectjtools-1.6.5.jar:/usr/**lib/hadoop/lib/commons-cli-1.**
 2.jar:/usr/lib/hadoop/lib/**commons-codec-1.4.jar:/usr/**
 lib/hadoop/lib/commons-daemon-**1.0.1.jar:/usr/lib/hadoop/lib/**
 commons-el-1.0.jar:/usr/lib/**hadoop/lib/commons-httpclient-**
 3.0.1.jar:/usr/lib/hadoop/lib/**commons-logging-1.0.4.jar:/**
 usr/lib/hadoop/lib/commons-**logging-api-1.0.4.jar:/usr/**
 lib/hadoop/lib/commons-net-1.**4.1.jar:/usr/lib/hadoop/lib/**
 core-3.1.1.jar:/usr/lib/**hadoop/lib/hadoop-**fairscheduler-0.20.2-cdh3u1.
 **jar:/usr/lib/hadoop/lib/**hsqldb-1.8.0.10.jar:/usr/lib/**
 hadoop/lib/jackson-core-asl-1.**5.2.jar:/usr/lib/hadoop/lib/**
 jackson-mapper-asl-1.5.2.jar:/**usr/lib/hadoop/lib/jasper-**
 compiler-5.5.12.jar:/usr/lib/**hadoop/lib/jasper-runtime-5.5.**
 12.jar:/usr/lib/hadoop/lib
 /jcl-over-slf4j-1.6.1.jar:/**usr/lib/hadoop/lib/jets3t-0.6.**
 1.jar:/usr/lib/hadoop/lib/**jetty-6.1.26.jar:/usr/lib/**
 hadoop/lib/jetty-servlet-**tester-6.1.26.jar:/usr/lib/**
 hadoop/lib/jetty-util-6.1.26.**jar:/usr/lib/hadoop/lib/jsch-**
 0.1.42.jar:/usr/lib/hadoop/**lib/junit-4.5.jar:/usr/lib/**
 hadoop/lib/kfs-0.2.2.jar:/usr/**lib/hadoop/lib/log4j-1.2.15.**
 jar:/usr/lib/hadoop/lib/**mockito-all-1.8.2.jar:/usr/**
 lib/hadoop/lib/oro-2.0.8.jar:/**usr/lib/hadoop/lib/servlet-**
 api-2.5-20081211.jar:/usr/lib/**hadoop/lib/servlet-api-2.5-6.**
 1.14.jar:/usr/lib/hadoop/lib/**slf4j-api-1.6.1.jar:/usr/lib/**
 hadoop/lib/slf4j-log4j12-1.6.**1.jar:/usr/lib/hadoop/lib/**
 xmlenc-0.52.jar:/usr/lib/**hadoop/lib/jsp-2.1/jsp-2.1.**
 jar:/usr/lib/hadoop/lib/jsp-2.**1/jsp-api-2.1.jar:/usr/share/**
 mailbox-convertor/lib/*:/usr/**lib/hadoop/contrib/capacity-**
 scheduler/hadoop-capacity-**scheduler-0.20.2-cdh3u1.jar:/**
 usr/lib/hbase/lib/hadoop-lzo-**0.4.13.jar:/usr/lib/hbase/**
 hbase.jar:/etc/hbase/conf:/**usr/lib/hbase/lib:/usr/lib/**
 zookeeper/zookeeper.jar:/usr/**lib/hadoop/contrib
 /capacity-scheduler/hadoop-**capacity-scheduler-0.20.2-**
 cdh3u1.jar:/usr/lib/hbase/lib/**hadoop-lzo-0.4.13.jar:/usr/**
 lib/hbase/hbase.jar:/etc/**hbase/conf:/usr/lib/hbase/lib:**
 /usr/lib/zookeeper/zookeeper.**jar' org.apache.hadoop.util.RunJar
 /usr/share/mailbox-convertor/**mailbox-convertor-0.1-**SNAPSHOT.jar
 -libjars=/usr/share/mailbox-**convertor/lib/antlr-2.7.7.jar,**
 /usr/share/mailbox-convertor/**lib/aopalliance-1.0.jar,/usr/**
 share/mailbox-convertor/lib/**asm-3.1.jar,/usr/share/**
 mailbox-convertor/lib/**backport-util-concurrent-3.1.**
 jar,/usr/share/mailbox-**convertor/lib/cglib-2.2.jar,/**
 usr/share/mailbox-convertor/**lib/hadoop-ant-3.0-u1.pom,/**
 usr/share/mailbox-convertor/**lib/speed4j-0.9.jar,/usr/**
 share/mailbox-convertor/lib/**jamm-0.2.2.jar,/usr/share/**
 mailbox-convertor/lib/uuid-3.**2.0.jar,/usr/share/mailbox-**
 convertor/lib/high-scale-lib-**1.1.1.jar,/usr/share/mailbox-**
 convertor/lib/jsr305-1.3.9.**jar,/usr/share/mailbox-**
 convertor/lib/guava-11.0.1.**jar,/usr/share/mailbox-**
 convertor/lib/protobuf-java-2.**4.0a.jar,/usr/share/mailbox-**
 convertor/lib/**concurrentlinkedhashmap-lru-1.**1.jar,/usr/share/mailbox-*
 *convertor/lib/json-simple-1.1.**jar,/usr/share/mailbox-**
 convertor/lib/itext-2.1.5.jar,**/usr/share/mailbox-convertor/**
 lib/jmxtools-1.2.1.jar,/usr/**share/mailbox-convertor/lib/**
 jersey-client-1.4.jar,/usr/**share/mailbox-converto
 r/lib/jersey-core-1.4.jar,/**usr/share/mailbox-convertor/**
 lib/jersey-json-1.4.jar,/usr/**share/mailbox-convertor/lib/**
 jersey-server-1.4.jar,/usr/**share/mailbox-convertor/lib/**
 jmxri-1.2.1.jar,/usr/share/**mailbox-convertor/lib/jaxb-**
 impl-2.1.12.jar,/usr/share/**mailbox-convertor/lib/xstream-**
 1.2.2.jar,/usr/share/mailbox-**convertor/lib/commons-metrics-**
 1.3.jar,/usr/share/mailbox-**convertor/lib/commons-**
 monitoring-2.9.1.jar,/usr/**share/mailbox-convertor/lib/**
 

Re: Setting eclipse for map reduce using maven

2012-02-28 Thread madhu phatak
Hi,
 Find maven definition for Hadoop core jars
here-http://search.maven.org/#browse|-856937612
.

On Tue, Feb 21, 2012 at 10:48 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am trying to search for dependencies that would help me get started with
 developing map reduce in eclipse and I prefer to use maven for this.

 Could someone help me point to directions?




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Subir S
http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

Read this link, your options are wrong below.



On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath austi...@gmail.com wrote:

 When I am using more than one reducer in hadoop streaming where I am using
 my custom separater rather than the tab, it looks like the hadoop shuffling
 process is not happening as it should.

 This is the reducer output when I am using '\t' to separate my key value
 pair that is output from the mapper.

 *output from reducer 1:*
 10321,22
 23644,37
 41231,42
 23448,20
 12325,39
 71234,20
 *output from reducer 2:*
 24123,43
 33213,46
 11321,29
 21232,32

 the above output is as expected the first column is the key and the second
 value is the count. There are 10 unique keys and 6 of them are in output of
 the first reducer and the remaining 4 int the second reducer output.

 But now when I use a custom separater for my key value pair output from my
 mapper. Here I am using '*' as the separator
 -D stream.mapred.output.field.separator=*
 -D mapred.reduce.tasks=2

 *output from reducer 1:*
 10321,5
 21232,19
 24123,16
 33213,28
 23644,21
 41231,12
 23448,18
 11321,29
 12325,24
 71234,9
 * *
 *output from reducer 2:*
 10321,17
 21232,13
 33213,18
 23644,16
 41231,30
 23448,2
 24123,27
 12325,15
 71234,11

 Now both the reducers are getting all the keys and part of the values go to
 reducer 1 and part of the reducer go to reducer 2.
 Why is it behaving like this when I am using a custom separator, shouldn't
 each reducer get a unique key after the shuffling?
 I am using Hadoop 0.20.205.0 and below is the command that I am using to
 run hadoop streaming. Is there some more options that I should specify for
 hadoop streaming to work properly if I am using a custom separator?

 hadoop jar
 $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
 -D stream.mapred.output.field.separator=*
 -D mapred.reduce.tasks=2
 -mapper ./map.py
 -reducer ./reducer.py
 -file ./map.py
 -file ./reducer.py
 -input /user/inputdata
 -output /user/outputdata
 -verbose


 Any help is much appreciated,
 Thanks,
 Austin



Re: Difference between hdfs dfs and hdfs fs

2012-02-28 Thread madhu phatak
Hi Mohit,
 FS is a generic filesystem which can point to any file systems like
LocalFileSystem,HDFS etc. But dfs is specific to HDFS. So when u use fs it
can copy from local file system to hdfs . But when u specify dfs src file
has to be on HDFS.

On Tue, Feb 21, 2012 at 10:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 What's the different between hdfs dfs and hdfs fs commands? When I run hdfs
 dfs -copyFromLocal /assa . and use pig it can't find it but when I use hdfs
 fs pig is able to find the file.




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: HDFS problem in hadoop 0.20.203

2012-02-28 Thread madhu phatak
Hi,
 Did you formatted the HDFS?

On Tue, Feb 21, 2012 at 7:40 PM, Shi Yu sh...@uchicago.edu wrote:

 Hi Hadoopers,

 We are experiencing a strange problem on Hadoop 0.20.203

 Our cluster has 58 nodes, everything is started from a fresh
 HDFS (we deleted all local folders on datanodes and
 reformatted the namenode).  After running some small jobs, the
 HDFS becomes behaving abnormally and the jobs become very
 slow.  The namenode log is crushed by Gigabytes of errors like
 is:

 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_4524177823306792294 is added
 to invalidSet of 10.105.19.31:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_4524177823306792294 is added
 to invalidSet of 10.105.19.18:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_4524177823306792294 is added
 to invalidSet of 10.105.19.32:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_2884522252507300332 is added
 to invalidSet of 10.105.19.35:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_2884522252507300332 is added
 to invalidSet of 10.105.19.27:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_2884522252507300332 is added
 to invalidSet of 10.105.19.33:50010
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.21:50010 is added to blk_-
 6843171124277753504_2279882 size 124490
 2012-02-21 00:00:38,632 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.allocateBlock:
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000313_0/result_stem-m-00313. blk_-
 6379064588594672168_2279890
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.26:50010 is added to blk_5338983375361999760_2279887
 size 1476
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.29:50010 is added to blk_-977828927900581074_2279887
 size 13818
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: DIR*
 NameSystem.completeFile: file
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000364_0/result_stem-m-00364 is closed by
 DFSClient_attempt_201202202043_0013_m_000364_0
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.23:50010 is added to blk_5338983375361999760_2279887
 size 1476
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.20:50010 is added to blk_5338983375361999760_2279887
 size 1476
 2012-02-21 00:00:38,633 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.allocateBlock:
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000364_0/result_suffix-m-00364.
 blk_1921685366929756336_2279890
 2012-02-21 00:00:38,634 INFO
 org.apache.hadoop.hdfs.StateChange: DIR*
 NameSystem.completeFile: file
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000279_0/result_suffix-m-00279 is closed by
 DFSClient_attempt_201202202043_0013_m_000279_0
 2012-02-21 00:00:38,635 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_495061820035691700 is added
 to invalidSet of 10.105.19.20:50010
 2012-02-21 00:00:38,635 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_495061820035691700 is added
 to invalidSet of 10.105.19.25:50010
 2012-02-21 00:00:38,635 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addToInvalidates: blk_495061820035691700 is added
 to invalidSet of 10.105.19.33:50010
 2012-02-21 00:00:38,635 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.allocateBlock:
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000284_0/result_stem-m-00284.
 blk_8796188324642771330_2279891
 2012-02-21 00:00:38,638 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.34:50010 is added to blk_-977828927900581074_2279887
 size 13818
 2012-02-21 00:00:38,638 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.allocateBlock:
 /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
 043_0013_m_000296_0/result_stem-m-00296. blk_-
 6800409224007034579_2279891
 2012-02-21 00:00:38,638 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated:
 10.105.19.29:50010 is added to blk_1921685366929756336_2279890
 size 1511
 2012-02-21 00:00:38,638 INFO
 org.apache.hadoop.hdfs.StateChange: BLOCK*
 

Re: PathFilter File Glob

2012-02-28 Thread Idris Ali
Hi,

Why not just use:
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(path+filter));

Thanks,
-Idris

On Mon, Feb 27, 2012 at 1:06 PM, Harsh J ha...@cloudera.com wrote:

 Hi Simon,

 You need to implement your custom PathFilter derivative class, and
 then set it via your {File}InputFormat class using setInputPathFilter:


 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPathFilter(org.apache.hadoop.mapred.JobConf,%20java.lang.Class)http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPathFilter%28org.apache.hadoop.mapred.JobConf,%20java.lang.Class%29

 (TextInputFormat is a derivative of FileInputFormat, and hence has the
 same method.)

 HTH.

 2012/2/23 Heeg, Simon s.h...@telekom.de:
  Hello,
 
  I would like to use a PathFilter for filtering the files with a regular
 expression which are read by the TextInputFormat, but I don't know how to
 apply the filter. I cannot find a setter. Unfortunately google was not my
 friend with this issue and The definitive Guide does  not help that much.
  I am using Hadoop 0.20.2-cdh3u3.
 

 --
 Harsh J



Re: hadoop streaming : need help in using custom key value separator

2012-02-28 Thread Austin Chungath
Thanks subir,

-D stream.mapred.output.field.separator=* is not an available option, my
bad
what I should have done is:

-D stream.map.output.field.separator=*
On Tue, Feb 28, 2012 at 2:36 PM, Subir S subir.sasiku...@gmail.com wrote:


 http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs

 Read this link, your options are wrong below.



 On Tue, Feb 28, 2012 at 1:13 PM, Austin Chungath austi...@gmail.com
 wrote:

  When I am using more than one reducer in hadoop streaming where I am
 using
  my custom separater rather than the tab, it looks like the hadoop
 shuffling
  process is not happening as it should.
 
  This is the reducer output when I am using '\t' to separate my key value
  pair that is output from the mapper.
 
  *output from reducer 1:*
  10321,22
  23644,37
  41231,42
  23448,20
  12325,39
  71234,20
  *output from reducer 2:*
  24123,43
  33213,46
  11321,29
  21232,32
 
  the above output is as expected the first column is the key and the
 second
  value is the count. There are 10 unique keys and 6 of them are in output
 of
  the first reducer and the remaining 4 int the second reducer output.
 
  But now when I use a custom separater for my key value pair output from
 my
  mapper. Here I am using '*' as the separator
  -D stream.mapred.output.field.separator=*
  -D mapred.reduce.tasks=2
 
  *output from reducer 1:*
  10321,5
  21232,19
  24123,16
  33213,28
  23644,21
  41231,12
  23448,18
  11321,29
  12325,24
  71234,9
  * *
  *output from reducer 2:*
   10321,17
  21232,13
  33213,18
  23644,16
  41231,30
  23448,2
  24123,27
  12325,15
  71234,11
 
  Now both the reducers are getting all the keys and part of the values go
 to
  reducer 1 and part of the reducer go to reducer 2.
  Why is it behaving like this when I am using a custom separator,
 shouldn't
  each reducer get a unique key after the shuffling?
  I am using Hadoop 0.20.205.0 and below is the command that I am using to
  run hadoop streaming. Is there some more options that I should specify
 for
  hadoop streaming to work properly if I am using a custom separator?
 
  hadoop jar
  $HADOOP_PREFIX/contrib/streaming/hadoop-streaming-0.20.205.0.jar
  -D stream.mapred.output.field.separator=*
  -D mapred.reduce.tasks=2
  -mapper ./map.py
  -reducer ./reducer.py
  -file ./map.py
  -file ./reducer.py
  -input /user/inputdata
  -output /user/outputdata
  -verbose
 
 
  Any help is much appreciated,
  Thanks,
  Austin
 



LZO exception decompressing (returned -8)

2012-02-28 Thread Marc Sturlese
Hey there,
I've been running a cluster for over a year and was getting a lzo
decompressing exception less than once a month. Suddenly it happens almost
once per day. Any ideas what could be causing it? I'm with hadoop 0.20.2
I've thought in moving to snappy but would like to know why this happens
more often now

The exception happens always when the reducer gets data from the map and
looks like:

Error: java.lang.InternalError: lzo1x_decompress returned: -8
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)

Thanks in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783652.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Need help on hadoop eclipse plugin

2012-02-28 Thread praveenesh kumar
So I made the above changes using WinRAR, it embedded those changes inside
the jar itself. I didn't need to extract the jar contents and construct new
jar again.
I just replaced this new jar with the old jar. Restart the eclipse with
eclipse -clean.
I am now able to run the hadoop eclipse plugin without any error in eclipse
helios 3.6.2.

However, now I am looking forward to use the same plugin in IBM RAD 8.0.
I am getting the following error in the .log :

!ENTRY org.eclipse.core.jobs 4 2 2012-02-28 05:26:12.056
!MESSAGE An internal error occurred during: Connecting to DFS lxe9700.
!STACK 0
java.lang.NoClassDefFoundError:
org.apache.hadoop.security.UserGroupInformation (initialization failure)
at java.lang.J9VMInternals.initialize(Unknown Source)
at org.apache.hadoop.fs.FileSystem$Cache$Key.init(Unknown Source)
at org.apache.hadoop.fs.FileSystem$Cache.get(Unknown Source)
at org.apache.hadoop.fs.FileSystem.get(Unknown Source)
at org.apache.hadoop.fs.FileSystem.get(Unknown Source)
at org.apache.hadoop.eclipse.server.HadoopServer.getDFS(Unknown Source)
at org.apache.hadoop.eclipse.dfs.DFSPath.getDFS(Unknown Source)
at
org.apache.hadoop.eclipse.dfs.DFSFolder.loadDFSFolderChildren(Unknown
Source)
at org.apache.hadoop.eclipse.dfs.DFSFolder$1.run(Unknown Source)
at org.eclipse.core.internal.jobs.Worker.run(Unknown Source)

I downloaded oracle jdk and changed the IBM RAD to use Oracle JDK 1.7 ,
still I am seeing the above error.
Can anyone help me in debugging this issue ?

Thanks,
Praveenesh

On Tue, Feb 28, 2012 at 1:12 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi all,

 I am trying to use hadoop eclipse plugin on my windows machine to connect
 to the my remote hadoop cluster. I am currently using putty to login to the
 cluster. So ssh is enable and my windows machine is able to listen to my
 hadoop cluster.

 I am using hadoop 0.20.205, hadoop-eclipse plugin -0.20.205.jar . eclipse
 helios Version: 3.6.2,  Oracle JDK 1.7

 If I am using original eclipse-plugin.jar by putting it inside my
 $ECLIPSE_HOME/dropins or /plugins folder, I am able to see Hadoop
 map-reduce perspective.

 But after specifying hadoop NN / JT connections, I am seeing the following
 error, whenever I am trying to access the HDFS.

 An internal error occurred during: Connecting to DFS lxe9700.
 org/apache/commons/configuration/Configuration

 Connecting to DFS lxe9700' has encountered a problem.
 An internal error occured during  Connecting to DFS

 After seeing the .log file .. I am seeing the following lines :

 !MESSAGE An internal error occurred during: Connecting to DFS lxe9700.
 !STACK 0
 java.lang.NoClassDefFoundError:
 org/apache/commons/configuration/Configuration
 at
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
 at
 org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
 at
 org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
 at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:196)
 at
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
 at
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
 at
 org.apache.hadoop.security.KerberosName.clinit(KerberosName.java:83)
 at
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:189)
 at
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
 at
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
 at
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
 at
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
 at
 org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1436)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1337)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:244)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:122)
 at
 org.apache.hadoop.eclipse.server.HadoopServer.getDFS(HadoopServer.java:469)
 at org.apache.hadoop.eclipse.dfs.DFSPath.getDFS(DFSPath.java:146)
 at
 org.apache.hadoop.eclipse.dfs.DFSFolder.loadDFSFolderChildren(DFSFolder.java:61)
 at org.apache.hadoop.eclipse.dfs.DFSFolder$1.run(DFSFolder.java:178)
 at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.commons.configuration.Configuration
 at
 org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:506)
 at
 org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422)
 at
 org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410)
 at
 

Re: ClassNotFoundException: -libjars not working?

2012-02-28 Thread Ioan Eugen Stan

Pe 28.02.2012 10:58, madhu phatak a scris:

Hi,
  -libjars doesn't always work.Better way is to create a runnable jar with
all dependencies ( if no of dependency is less) or u have to keep the jars
into the lib folder of the hadoop in all machines.



Thanks for the reply Madhu,

I adopted the second solution as explained in [1]. From what I found 
browsing the net it seems that -libjars is broken in hadoop version  
0.18. I didn't got time to check the code yet. Cloudera released hadoop 
sources are packaged a bit odd and Netbeans doens't seem to play well 
with that and this really affects my will to try to fix the problem.


-libjars is a nice feature that permits the use of skinny jars and 
would help system admins do better packaging. It also allows better 
control over the classpath. Too bad it didn't work.



[1] 
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/


Cheers,

--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Which version of the Hadoop LZO library are you using? It looks like something 
I'm pretty sure was fixed in a newer version. 

-Joey



On Feb 28, 2012, at 4:58, Marc Sturlese marc.sturl...@gmail.com wrote:

 Hey there,
 I've been running a cluster for over a year and was getting a lzo
 decompressing exception less than once a month. Suddenly it happens almost
 once per day. Any ideas what could be causing it? I'm with hadoop 0.20.2
 I've thought in moving to snappy but would like to know why this happens
 more often now
 
 The exception happens always when the reducer gets data from the map and
 looks like:
 
 Error: java.lang.InternalError: lzo1x_decompress returned: -8
at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
 Method)
at
 com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
at
 org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
at
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)
 
 Thanks in advance.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783652.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Marc Sturlese
I'm with 0.4.9 (think is the latest)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783927.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-02-28 Thread Joey Echeverria
Try 0.4.15. You can get it from here:

https://github.com/toddlipcon/hadoop-lzo

Sent from my iPhone

On Feb 28, 2012, at 6:49, Marc Sturlese marc.sturl...@gmail.com wrote:

 I'm with 0.4.9 (think is the latest)
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3783927.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Handling bad records

2012-02-28 Thread Harsh J
Subir,

No, not unless you use a specialized streaming library (pydoop, dumbo,
etc. for python, for example).

On Tue, Feb 28, 2012 at 2:19 PM, Subir S subir.sasiku...@gmail.com wrote:
 Can multiple output be used with Hadoop Streaming?

 On Tue, Feb 28, 2012 at 2:07 PM, madhu phatak phatak@gmail.com wrote:

 Hi Mohit ,
  A and B refers to two different output files (multipart name). The file
 names will be seq-A* and seq-B*.  Its similar to r in part-r-0

 On Tue, Feb 28, 2012 at 11:37 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Thanks that's helpful. In that example what is A and B referring to?
 Is
  that the output file name?
 
  mos.getCollector(seq, A, reporter).collect(key, new Text(Bye));
  mos.getCollector(seq, B, reporter).collect(key, new Text(Chau));
 
 
  On Mon, Feb 27, 2012 at 9:53 PM, Harsh J ha...@cloudera.com wrote:
 
   Mohit,
  
   Use the MultipleOutputs API:
  
  
 
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
   to have a named output of bad records. There is an example of use
   detailed on the link.
  
   On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch...@gmail.com
 
   wrote:
What's the best way to write records to a different file? I am doing
  xml
processing and during processing I might come accross invalid xml
  format.
Current I have it under try catch block and writing to log4j. But I
  think
it would be better to just write it to an output file that just
  contains
errors.
  
  
  
   --
   Harsh J
  
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/




-- 
Harsh J


Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
Hi,

Some time ago I had an idea and implemented it.

Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Spilled Records

2012-02-28 Thread Jie Li
Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi All,

 I am trying to improve the performance of my hadoop cluster and would like
 to get some feedback on a couple of numbers that I am seeing.

 Below is the output from a single task (1 of 16) that took 3 mins 40
 Seconds

 FileSystemCounters
 FILE_BYTES_READ 214,653,748
 HDFS_BYTES_READ 67,108,864
 FILE_BYTES_WRITTEN 429,278,388

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,221,478
 Spilled Records 4,442,956
 Map output bytes 210,196,148
 Combine input records 0
 Map output records 2,221,478

 And another task in the same job (16 of 16) that took 7 minutes and 19
 seconds

 FileSystemCounters
 FILE_BYTES_READ 199,003,192
 HDFS_BYTES_READ 58,434,476
 FILE_BYTES_WRITTEN 397,975,310

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,086,789
 Spilled Records 4,173,578 Map output bytes
 194,813,958
 Combine input records 0 Map output records 2,086,789

 Can anybody determine anything from these figures?

 The first task is twice as quick as the second yet the input and output
 are comparable (certainly not double). In all of the tasks (in this and
 other jobs) the spilled records are always double the output records, this
 can't be 'normal'?

 Am I clutching at straws (it feels like I am).

 Thanks in advance, Dan.




RE: Spilled Records

2012-02-28 Thread Daniel Baptista
Hi Jie,

To be honest I don't think I understand enough of what our job is doing to be 
able to explain it. 

Thanks for the response though, I had figured that I was grasping at straws.

I have looped at Starfish however all our jobs are submitted via Apache Pig so 
I don't know if it would be much good.

Thanks again, Dan. 

-Original Message-
From: Jie Li [mailto:ji...@cs.duke.edu] 
Sent: 28 February 2012 16:35
To: common-user@hadoop.apache.org
Subject: Re: Spilled Records

Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi All,

 I am trying to improve the performance of my hadoop cluster and would like
 to get some feedback on a couple of numbers that I am seeing.

 Below is the output from a single task (1 of 16) that took 3 mins 40
 Seconds

 FileSystemCounters
 FILE_BYTES_READ 214,653,748
 HDFS_BYTES_READ 67,108,864
 FILE_BYTES_WRITTEN 429,278,388

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,221,478
 Spilled Records 4,442,956
 Map output bytes 210,196,148
 Combine input records 0
 Map output records 2,221,478

 And another task in the same job (16 of 16) that took 7 minutes and 19
 seconds

 FileSystemCounters
 FILE_BYTES_READ 199,003,192
 HDFS_BYTES_READ 58,434,476
 FILE_BYTES_WRITTEN 397,975,310

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,086,789
 Spilled Records 4,173,578 Map output bytes
 194,813,958
 Combine input records 0 Map output records 2,086,789

 Can anybody determine anything from these figures?

 The first task is twice as quick as the second yet the input and output
 are comparable (certainly not double). In all of the tasks (in this and
 other jobs) the spilled records are always double the output records, this
 can't be 'normal'?

 Am I clutching at straws (it feels like I am).

 Thanks in advance, Dan.




Hadoop and Hibernate

2012-02-28 Thread Geoffry Roberts
All,

I am trying to use Hibernate within my reducer and it goeth not well.  Has
anybody ever successfully done this?

I have a java package that contains my Hadoop driver, mapper, and reducer
along with a persistence class.  I call Hibernate from the cleanup() method
in my reducer class.  It complains that it cannot find the persistence
class.  The class is in the same package as the reducer and this all would
work outside of Hadoop. The error is thrown when I attempt to begin a
transaction.

The error:

org.hibernate.MappingException: Unknown entity: qq.mob.depart.EpiState

The code:

protected void cleanup(Context ctx) throws IOException,
   InterruptedException {
...
org.hibernate.cfg.Configuration cfg = new org.hibernate.cfg.Configuration();
SessionFactory sessionFactory =
cfg.configure(hibernate.cfg.xml).buildSessionFactory();
cfg.addAnnotatedClass(EpiState.class); // This class is in the same
package as the reducer.
Session session = sessionFactory.openSession();
Transaction tx = session.getTransaction();
tx.begin(); //Error is thrown here.
...
}

If I create an executable jar file that contains all dependencies required
by the MR job do all said dependencies get distributed to all nodes?

If I specify but one reducer, which node in the cluster will the reducer
run on?

Thanks
-- 
Geoffry Roberts


Re: Spilled Records

2012-02-28 Thread Jie Li
Hi Dan,

You might want to post your Pig script to the Pig user mailing list.
Previously I did some experiments on Pig and Hive and I'll also be
interested in looking into your script.

Yeah Starfish now only supports Hadoop job-level tuning, and supporting
workflow like Pig and Hive is our top priority. We'll let you know once
we're ready.

Thanks,
Jie

On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi Jie,

 To be honest I don't think I understand enough of what our job is doing to
 be able to explain it.

 Thanks for the response though, I had figured that I was grasping at
 straws.

 I have looped at Starfish however all our jobs are submitted via Apache
 Pig so I don't know if it would be much good.

 Thanks again, Dan.

 -Original Message-
 From: Jie Li [mailto:ji...@cs.duke.edu]
 Sent: 28 February 2012 16:35
 To: common-user@hadoop.apache.org
 Subject: Re: Spilled Records

 Hello Dan,

 The fact that the spilled records are double as the output records means
 the map task produces more than one spill file, and these spill files are
 read, merged and written to a single file, thus each record is spilled
 twice.

 I can't infer anything from the numbers of the two tasks. Could you provide
 more info, such as what the application is doing?

 If you like, you can also try our tool Starfish to see what's going on
 behind.

 Thanks,
 Jie
 --
 Starfish is an intelligent performance tuning tool for Hadoop.
 Homepage: www.cs.duke.edu/starfish/
 Mailing list: http://groups.google.com/group/hadoop-starfish


 On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
 daniel.bapti...@performgroup.com wrote:

  Hi All,
 
  I am trying to improve the performance of my hadoop cluster and would
 like
  to get some feedback on a couple of numbers that I am seeing.
 
  Below is the output from a single task (1 of 16) that took 3 mins 40
  Seconds
 
  FileSystemCounters
  FILE_BYTES_READ 214,653,748
  HDFS_BYTES_READ 67,108,864
  FILE_BYTES_WRITTEN 429,278,388
 
  Map-Reduce Framework
  Combine output records 0
  Map input records 2,221,478
  Spilled Records 4,442,956
  Map output bytes 210,196,148
  Combine input records 0
  Map output records 2,221,478
 
  And another task in the same job (16 of 16) that took 7 minutes and 19
  seconds
 
  FileSystemCounters
  FILE_BYTES_READ 199,003,192
  HDFS_BYTES_READ 58,434,476
  FILE_BYTES_WRITTEN 397,975,310
 
  Map-Reduce Framework
  Combine output records 0
  Map input records 2,086,789
  Spilled Records 4,173,578 Map output bytes
  194,813,958
  Combine input records 0 Map output records 2,086,789
 
  Can anybody determine anything from these figures?
 
  The first task is twice as quick as the second yet the input and output
  are comparable (certainly not double). In all of the tasks (in this and
  other jobs) the spilled records are always double the output records,
 this
  can't be 'normal'?
 
  Am I clutching at straws (it feels like I am).
 
  Thanks in advance, Dan.
 
 




Re: Hadoop and Hibernate

2012-02-28 Thread Owen O'Malley
On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts
geoffry.robe...@gmail.com wrote:

 If I create an executable jar file that contains all dependencies required
 by the MR job do all said dependencies get distributed to all nodes?

You can make a single jar and that will be distributed to all of the
machines that run the task, but it is better in most cases to use the
distributed cache.

See 
http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache

 If I specify but one reducer, which node in the cluster will the reducer
 run on?

The scheduling is done by the JobTracker and it isn't possible to
control the location of the reducers.

-- Owen


Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
I commented reducer and combiner both and still I see the same exception.
Could it be because I have 2 jars being added?

On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
  org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 



Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
It looks like adding this line causes invocation exception. I looked in
hdfs and I see that file in that path

DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf);

I have similar code for another jar
DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
conf); but this works just fine.


On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I commented reducer and combiner both and still I see the same exception.
 Could it be because I have 2 jars being added?

  On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.comwrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 





Re: How to modify hadoop-wordcount example to display File-wise results.

2012-02-28 Thread orayvah

Hi Srilathar,

I know this thread is quite old but I need your help with this.

I'm interested in also making some modifications to the hadoop Sort example.
Please could you give me pointers on how to rebuild hadoop to reflect the
changes made in the source.

I'm new to hadoop and would really appreciate your assistance.



us latha wrote:
 
 Greetings!
 
 Hi, Am trying to modify the WordCount.java mentioned at Example: WordCount
 v1.0http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0at
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
 Would like to have output the following way,
 
 FileOneword1  itsCount
 FileOneword2  itsCount
   ..(and so on)
 FileTwoword1  itsCount
 FileTwowordx  its Count
  ..
 FileThree  word1 its Count
  ..
 
 Am trying to do following changes to the code of WordCount.java
 
 1)  private Text filename = new Text();  // Added this to Map class .Not
 sure if I would have access to filename here.
 2)  (line 18)OutputCollectorText, Text, IntWritable output  // Changed
 the
 argument in the map() function to have another Text field.
 3)  (line 23) output.collect(filename, word , one); // Trying to change
 the
 output format as 'filename word count'
 
 Am not sure what other changes are to be affected to achieve the required
 output. filename is not available to the map method.
 My requirement is to go through all the data available in hdfs and prepare
 an index file with  filename word count  format.
 Could you please throw light on how I can achieve this.
 
 Thankyou
 Srilatha
 
 

-- 
View this message in context: 
http://old.nabble.com/How-to-modify-hadoop-wordcount-example-to-display-File-wise-results.-tp19826857p33410747.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia
I am comparing runtime of similar logic. The entire logic is exactly same
but surprisingly map reduce job that I submit is 100x slow. For pig I use
udf and for hadoop I use mapper only and the logic same as pig. Even the
splits on the admin page are same. Not sure why it's so slow. I am
submitting job like:

java -classpath
.:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
com.services.dp.analytics.hadoop.mapred.FormMLProcessor
/examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
/examples/output1/

How should I go about looking the root cause of why it's so slow? Any
suggestions would be really appreciated.



One of the things I noticed is that on the admin page of map task list I
see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
for pig the status is blank.


Re: 100x slower mapreduce compared to pig

2012-02-28 Thread Prashant Kommireddi
It would be great if we can take a look at what you are doing in the UDF vs
the Mapper.

100x slow does not make sense for the same job/logic, its either the Mapper
code or may be the cluster was busy at the time you scheduled MapReduce job?

Thanks,
Prashant

On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am comparing runtime of similar logic. The entire logic is exactly same
 but surprisingly map reduce job that I submit is 100x slow. For pig I use
 udf and for hadoop I use mapper only and the logic same as pig. Even the
 splits on the admin page are same. Not sure why it's so slow. I am
 submitting job like:

 java -classpath

 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
 com.services.dp.analytics.hadoop.mapred.FormMLProcessor

 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
 /examples/output1/

 How should I go about looking the root cause of why it's so slow? Any
 suggestions would be really appreciated.



 One of the things I noticed is that on the admin page of map task list I
 see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
 for pig the status is blank.



[blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Jason Trost
Blog post for anyone who's interested.  I cover a basic howto for
getting Nutch to use Apache Gora to store web crawl data in Accumulo.

Let me know if you have any questions.

Accumulo, Nutch, and GORA
http://www.covert.io/post/18414889381/accumulo-nutch-and-gora

--Jason


toward Rack-Awareness approach

2012-02-28 Thread Patai Sangbutsarakum
Hi Hadoopers,

Currently I am running hadoop version 0.20.203 in production with 600 TB in her.
I am planning to enable rack awareness in my production, but I still
didn't see it through.

plan/questions.

1. I have script that can solve datanode/tasktracker IP to rack name.
2. Add topology.script.file.name in hdfs-site.xml and restart cluster.
3. After the cluster come back, my question start here,
- do i have to run balancer or fsck or some command to have those
600 TB become redistribute to different rack in one time ?
- currently i run balancer 2 hrs. everyday, can i keep this
routine and hope that at some point the data will be nicely
redistributed and aware of rack location ?
- how could we know that the data in the cluster is now fully rack
awareness ??
- if i just add the script and run balancer 2 hrs everyday, before
the whole data become rack awareness. the data will be kind
  of mix between default-rack of existing data (haven't get
balanced) and probably new loaded data will be rack-awareness.
  is it OK ? to have mix of default-rack and rack-specific data together ?

4. thought ?

Hope this make sense,

Thanks in advance
Patai


Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Mattmann, Chris A (388J)
UMMM wow!

That's awesome Jason! Thanks so much!

Cheers,
Chris

On Feb 28, 2012, at 5:41 PM, Jason Trost wrote:

 Blog post for anyone who's interested.  I cover a basic howto for
 getting Nutch to use Apache Gora to store web crawl data in Accumulo.
 
 Let me know if you have any questions.
 
 Accumulo, Nutch, and GORA
 http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
 
 --Jason


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Enis Söztutar
Fabulous work!

There are obviously a lot of local modifications to be done for nutch +
gora + accumulo to work. So feel free to propose these to upstream nutch
and gora.

It should feel good to run the web crawl, and store the results on
accumulo.

Cheers,
Enis

On Tue, Feb 28, 2012 at 6:24 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 UMMM wow!

 That's awesome Jason! Thanks so much!

 Cheers,
 Chris

 On Feb 28, 2012, at 5:41 PM, Jason Trost wrote:

  Blog post for anyone who's interested.  I cover a basic howto for
  getting Nutch to use Apache Gora to store web crawl data in Accumulo.
 
  Let me know if you have any questions.
 
  Accumulo, Nutch, and GORA
  http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
 
  --Jason


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




Re: Invocation exception

2012-02-28 Thread Harsh J
Mohit,

If you visit the failed task attempt on the JT Web UI, you can see the
complete, informative stack trace on it. It would point the exact line
the trouble came up in and what the real error during the
configure-phase of task initialization was.

A simple attempts page goes like the following (replace job ID and
task ID of course):

http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00

Once there, find and open the All logs link to see stdout, stderr,
and syslog of the specific failed task attempt. You'll have more info
sifting through this to debug your issue.

This is also explained in Tom's book under the title Debugging a Job
(p154, Hadoop: The Definitive Guide, 2nd ed.).

On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 It looks like adding this line causes invocation exception. I looked in
 hdfs and I see that file in that path

 DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf);

 I have similar code for another jar
 DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
 conf); but this works just fine.


 On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I commented reducer and combiner both and still I see the same exception.
 Could it be because I have 2 jars being added?

  On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.comwrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 






-- 
Harsh J


Re: Browse the filesystem weblink broken after upgrade to 1.0.0: HTTP 404 Problem accessing /browseDirectory.jsp

2012-02-28 Thread madhu phatak
Hi,
 Just make sure that Datanode is up. Looking into the datanode logs.

On Sun, Feb 19, 2012 at 10:52 PM, W.P. McNeill bill...@gmail.com wrote:

 I am running in pseudo-distributed on my Mac and just upgraded from
 0.20.203.0 to 1.0.0. The web interface for HDFS which was working in
 0.20.203.0 is broken in 1.0.0.

 HDFS itself appears to work: a command line like hadoop fs -ls / returns
 a result, and the namenode web interface at http://
 http://localhost:50070/dfshealth.jsp comes up. However, when I click on
 the
 Browse the filesystem link on this page I get a 404 Error. The error
 message displayed in the browser reads:

 Problem accessing /browseDirectory.jsp. Reason:
/browseDirectory.jsp

 The URL in the browser bar at this point is 
 http://0.0.0.0:50070/browseDirectory.jsp?namenodeInfoPort=50070dir=/;.
 The
 HTML source to the link on the main namenode page is a
 href=/nn_browsedfscontent.jspBrowse the filesystem/a. If I change the
 server location from 0.0.0.0 to localhost in my browser bar I get the same
 error.

 I updated my configuration files in the new hadoop 1.0.0 conf directory to
 transfer over my settings from 0.20.203.0. My conf/slaves file consists of
 the line localhost.  I ran hadoop-daemon.sh start namenode -upgrade
 once when prompted my errors in the namenode logs. After that all the
 namenode and datanode logs contain no errors.

 For what it's worth, I've verified that the bug occurs on Firefox, Chrome,
 and Safari.

 Any ideas on what is wrong or how I should go about further debugging it?




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Invocation exception

2012-02-28 Thread Subir S
Sorry I missed this email.
Harsh answer is apt. Please see the error log from Job Tracker web ui for
failed tasks (mapper/reducer) to know the exact reason.

On Tue, Feb 28, 2012 at 10:23 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Does it matter if reducer is set even if the no of reducers is 0? Is there
 a way to get more clear reason?

 On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com
 wrote:

  On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   For some reason I am getting invocation exception and I don't see any
  more
   details other than this exception:
  
   My job is configured as:
  
  
   JobConf conf = *new* JobConf(FormMLProcessor.*class*);
  
   conf.addResource(hdfs-site.xml);
  
   conf.addResource(core-site.xml);
  
   conf.addResource(mapred-site.xml);
  
   conf.set(mapred.reduce.tasks, 0);
  
   conf.setJobName(mlprocessor);
  
   DistributedCache.*addFileToClassPath*(*new*
 Path(/jars/analytics.jar),
   conf);
  
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
   conf);
  
   conf.setOutputKeyClass(Text.*class*);
  
   conf.setOutputValueClass(Text.*class*);
  
   conf.setMapperClass(Map.*class*);
  
   conf.setCombinerClass(Reduce.*class*);
  
   conf.setReducerClass(IdentityReducer.*class*);
  
 
  Why would you set the Reducer when the number of reducers is set to zero.
  Not sure if this is the real cause.
 
 
  
   conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
  
   conf.setOutputFormat(TextOutputFormat.*class*);
  
   FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
  
   FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
  
   JobClient.*runJob*(conf);
  
   -
   *
  
   java.lang.RuntimeException*: Error in configuring object
  
   at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
   ReflectionUtils.java:93*)
  
   at
  
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
  
   at org.apache.hadoop.util.ReflectionUtils.newInstance(*
   ReflectionUtils.java:117*)
  
   at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
  
   at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
  
   at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
  
   at java.security.AccessController.doPrivileged(*Native Method*)
  
   at javax.security.auth.Subject.doAs(*Subject.java:396*)
  
   at org.apache.hadoop.security.UserGroupInformation.doAs(*
   UserGroupInformation.java:1157*)
  
   at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
  
   Caused by: *java.lang.reflect.InvocationTargetException
   *
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(*
   NativeMethodAccessorImpl.java:39*)
  
   at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
  
 



Re: namenode null pointer

2012-02-28 Thread madhu phatak
Hi,
 This may be the issue with namenode is not correctly formatted.

On Sat, Feb 18, 2012 at 1:50 PM, Ben Cuthbert bencuthb...@ymail.com wrote:

 All sometimes when I startup my hadoop I get the following error

 12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host =iMac.local/192.168.0.191
 STARTUP_MSG: args = []
 STARTUP_MSG: version = 0.20.203.0
 STARTUP_MSG: build =
 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
  1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
 /
 12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started:
 Cannot locate configuration: tried hadoop-metrics2-namenode.properties,
 hadoop-metrics2.properties
 2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from
 SCDynamicStore
 12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit
 12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB
 12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries
 12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152
 12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue
 12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup
 12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/02/17 10:29:57 INFO namenode.FSNamesystem:
 dfs.block.invalidate.limit=100
 12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
 accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
 12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more
 than 10 times
 12/02/17 10:29:57 INFO common.Storage: Number of files = 190
 12/02/17 10:29:57 INFO common.Storage: Number of files under construction
 = 0
 12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in
 0 seconds.
 12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205)
 at
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:353)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)

 12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191
 /




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: What determines task attempt list URLs?

2012-02-28 Thread madhu phatak
Hi,
 Its better to use the hos  tnames rather than the ipaddress. If you use
hostnames , task_attempt URL will contain the hostname rather than
localhost .

On Fri, Feb 17, 2012 at 10:52 PM, Keith Wiley kwi...@keithwiley.com wrote:

 What property or setup parameter determines the URLs displayed on the task
 attempts webpage of the job/task trackers?  My cluster seems to be
 configured such that all URLs for higher pages (the top cluster admin page,
 the individual job overview page, and the map/reduce task list page) show
 URLs by ip address, but the lowest page (the task attempt list for a single
 task) shows the URLs for the Machine and Task Logs columns by localhost,
 not by ip address (although the Counters column still uses the ip address
 just like URLs on all the higher pages).

 The localhost links obviously don't work (the cluster is not on the
 local machine, it's on Tier 3)...unless I just happen to have a cluster
 also running on my local machine; then the links work but obviously they go
 to my local machine and thus describe a completely unrelated Hadoop
 cluster!!!  It goes without saying, that's ridiculous.

 So to get it to work, I have to manually copy/paste the ip address into
 the URLs every time I want to view those pages...which makes it incredibly
 tedious to view the task logs.

 I've asked this a few times now and have gotten no response.  Does no one
 have any idea how to properly configure Hadoop to get around this?  I've
 experimented with the mapred-site.xml mapred.job.tracker and
 mapred.task.tracker.http.address properties to no avail.

 What's going on here?

 Desperate


 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com

 I used to be with it, but then they changed what it was.  Now, what I'm
 with
 isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson

 




-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: namenode null pointer

2012-02-28 Thread Ben Cuthbert
So the filesystem has corrupted?

Regards

Ben
On 29 Feb 2012, at 05:51, madhu phatak wrote:

 Hi,
 This may be the issue with namenode is not correctly formatted.
 
 On Sat, Feb 18, 2012 at 1:50 PM, Ben Cuthbert bencuthb...@ymail.com wrote:
 
 All sometimes when I startup my hadoop I get the following error
 
 12/02/17 10:29:56 INFO namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host =iMac.local/192.168.0.191
 STARTUP_MSG: args = []
 STARTUP_MSG: version = 0.20.203.0
 STARTUP_MSG: build =
 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
  1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
 /
 12/02/17 10:29:56 WARN impl.MetricsSystemImpl: Metrics system not started:
 Cannot locate configuration: tried hadoop-metrics2-namenode.properties,
 hadoop-metrics2.properties
 2012-02-17 10:29:56.994 java[4065:1903] Unable to load realm info from
 SCDynamicStore
 12/02/17 10:29:57 INFO util.GSet: VM type = 64-bit
 12/02/17 10:29:57 INFO util.GSet: 2% max memory = 17.77875 MB
 12/02/17 10:29:57 INFO util.GSet: capacity = 2^21 = 2097152 entries
 12/02/17 10:29:57 INFO util.GSet: recommended=2097152, actual=2097152
 12/02/17 10:29:57 INFO namenode.FSNamesystem: fsOwner=scottsue
 12/02/17 10:29:57 INFO namenode.FSNamesystem: supergroup=supergroup
 12/02/17 10:29:57 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/02/17 10:29:57 INFO namenode.FSNamesystem:
 dfs.block.invalidate.limit=100
 12/02/17 10:29:57 INFO namenode.FSNamesystem: isAccessTokenEnabled=false
 accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
 12/02/17 10:29:57 INFO namenode.FSNamesystem: Registered
 FSNamesystemStateMBean and NameNodeMXBean
 12/02/17 10:29:57 INFO namenode.NameNode: Caching file names occuring more
 than 10 times
 12/02/17 10:29:57 INFO common.Storage: Number of files = 190
 12/02/17 10:29:57 INFO common.Storage: Number of files under construction
 = 0
 12/02/17 10:29:57 INFO common.Storage: Image file of size 26377 loaded in
 0 seconds.
 12/02/17 10:29:57 ERROR namenode.NameNode: java.lang.NullPointerException
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1113)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1125)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1028)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:205)
 at
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:613)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
 at
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:353)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:434)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
 
 12/02/17 10:29:57 INFO namenode.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at iMac.local/192.168.0.191
 /
 
 
 
 
 -- 
 Join me at http://hadoopworkshop.eventbrite.com/