Re: grahical tool for hadoop mapreduce

2009-06-26 Thread Kevin Weil
Some people at sun have done some recent work on this -- see a blog post at
http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_and_performance,
and a subsequent post with more detail at
http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_monitoring_scripts
.

Kevin

On Thu, Jun 25, 2009 at 7:28 PM, Manhee Jo j...@nttdocomo.com wrote:

 Hi,

 Do you know any graphical tools to show the progress of mapreduce using
 the job log under logs/history/ ? The web interface (namenode:50030) gives
 me
 similar one. But what I need is more specific ones that show the number of
 total running map tasks and reduce tasks at some points of time,
 which I've seen from some papers. Any help would be appreciated.


 Thanks,
 Manhee




Re: Pregel

2009-06-26 Thread Owen O'Malley


On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote:

my guess, as good as anybody's, is that Pregel is to large graphs is  
what

Hadoop is to large datasets.


I think it is much more likely a language that allows you to easily  
define fixed point algorithms.  I would imagine a distributed version  
of something similar to Michal Young's GenSet. http://portal.acm.org/citation.cfm?doid=586094.586108


I've been trying to figure out how to justify working on a project  
like that for a couple of years, but haven't yet. (I have a background  
in program static analysis, so I've implemented similar stuff.)



In other words, Pregel is the next natural step
for massively scalable computations after Hadoop.


I wonder if it uses map/reduce as a base or not. It would be easier to  
use map/reduce, but a direct implementation would be more performant.  
In either case, it is a new hammer. From what I see, it likely won't  
replace map/reduce, pig, or hive; but rather support a different class  
of applications much more directly than you can under map/reduce.


-- Owen



Re: What is the best way to use the Hadoop output data

2009-06-26 Thread Huy Phan
Anybody help me on this ? :)

On Thu, Jun 25, 2009 at 5:02 PM, Huy Phan dac...@gmail.com wrote:

 Hi everybody, I'm working on a hadoop project that processing the log
 files. In the reduce part, as usual, I store the output to HDFS, but I also
 want send those output data to the message queue using HTTP Post Request.
 I'm wondering if there's any performance killer in this approach, I posted
 the question to IRC channel and someone told me that there may be a
 bottleneck.
 Then I think about running a cron task to get the output data and send it
 to MQ, but not sure it's the best way cause it's not synchronize with the
 MapReduce process.
 I wonder if there is any way to spawn a process directly from Hadoop after
 all the MapReduce tasks finish ?




Re: What is the best way to use the Hadoop output data

2009-06-26 Thread Zhong Wang
Hi Huy,

On Thu, Jun 25, 2009 at 6:02 PM, Huy Phandac...@gmail.com wrote:
 I'm wondering if there's any performance killer in this approach, I posted
 the question to IRC channel and someone told me that there may be a
 bottleneck.

There may be some communication errors to block your MapReduce job
when you post your output data. So I think it's better to do this
after the job is done.

 I wonder if there is any way to spawn a process directly from Hadoop after
 all the MapReduce tasks finish ?


How do you submit your jobs? You can block the job submit process by
running job using job.waitForCompletion(true) in your main driver
class. Then the two processes are synchronous.


-- 
Zhong Wang


Performance hit by not splitting .bz2?

2009-06-26 Thread Erik Forsberg
Hi!

I have a case where we need to analyse logfiles. They are currently
compressed using bzip2, and an example logfile is roughly 105Mb
compressed, 720Mb uncompressed.

I'm considering using a Hadoop version with .bz2 support - probably
Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
not split. 

I expect that for most jobs, the number of log files will exceed the
number of cores in my hadoop cluster.

Is it possible to estimate if I'll get a performance hit
because of the lack of splitting under these circumstances?

Thanks,
\EF
-- 
Erik Forsberg forsb...@opera.com
Developer, Opera Mini - http://www.opera.com/mini/


Re: Performance hit by not splitting .bz2?

2009-06-26 Thread Zhong Wang
Hi Erik,

On Fri, Jun 26, 2009 at 4:24 PM, Erik Forsbergforsb...@opera.com wrote:

 I'm considering using a Hadoop version with .bz2 support - probably
 Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
 not split.

Yes. The bzip2 compressed files are not splittable in current
versions, maybe it will be introduced in next version. You may be
interested in this patch
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.

 I expect that for most jobs, the number of log files will exceed the
 number of cores in my hadoop cluster.

 Is it possible to estimate if I'll get a performance hit
 because of the lack of splitting under these circumstances?

The bzip2 files are not split that means your block size of HDFS is
720M. Even though the number of your log files may exceed the number
of cores in your cluster, large blocks will decrease load balancing.


-- 
Zhong Wang


Re: Pregel

2009-06-26 Thread Edward J. Yoon
According to my understanding, I think the Pregel is in same layer
with MR, not a MR based language processor.

I think the 'Collective Communication' of BSP seems the core of the
problem. For example, this BFS problem
(http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html)
can be solved at once w/o MR iterations.

On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote:

 On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote:

 my guess, as good as anybody's, is that Pregel is to large graphs is what
 Hadoop is to large datasets.

 I think it is much more likely a language that allows you to easily define
 fixed point algorithms.  I would imagine a distributed version of something
 similar to Michal Young's GenSet.
 http://portal.acm.org/citation.cfm?doid=586094.586108

 I've been trying to figure out how to justify working on a project like that
 for a couple of years, but haven't yet. (I have a background in program
 static analysis, so I've implemented similar stuff.)

 In other words, Pregel is the next natural step
 for massively scalable computations after Hadoop.

 I wonder if it uses map/reduce as a base or not. It would be easier to use
 map/reduce, but a direct implementation would be more performant. In either
 case, it is a new hammer. From what I see, it likely won't replace
 map/reduce, pig, or hive; but rather support a different class of
 applications much more directly than you can under map/reduce.

 -- Owen





-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardy...@apache.org
http://blog.udanax.org


Re: Doing MapReduce over Har files

2009-06-26 Thread jchernandez

I also need help with this. I need to know how to handle a HAR file when it
is the input to a MapReduce task. How do we read the HAR file so we can work
on the individual logical files? I suppose we need to create our own
InputFormat and RecordReader files, but I´m not sure how to proceed.

Julian 


Roshan James-3 wrote:
 
 When I run map reduce task over a har file as the input, I see that the
 input splits refer to 64mb byte boundaries inside the part file.
 
 My mappers only know how to process the contents of each logical file
 inside
 the har file. Is there some way by which I can take the offset range
 specified by the input split and determine which logical files lie in that
 offset range? (How else would one do map reduce over a har file?)
 
 Roshan
 
 

-- 
View this message in context: 
http://www.nabble.com/Doing-MapReduce-over-Har-files-tp24171216p24217500.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Permissions needed to run RandomWriter ?

2009-06-26 Thread stephen mulcahy

Hi,

I've just installed a new test cluster and I'm trying to give it a quick 
smoke test with RandomWriter and Sort.


I can run these fine with the superuser account. When I try to run them 
as another user I run into problems even though I've created the output 
directory and given permissions to the other user to write to this 
directory. i.e.


1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo
mkdir: org.apache.hadoop.fs.permission.AccessControlException: 
Permission denied: user=smulcahy, access=WRITE, 
inode=:hadoop:supergroup:rwxr-xr-x


OK - we don't have permissions anyways

2. had...@hadoop01:/$ hadoop fs -mkdir /foo

OK

3. hadoop fs -chown -R smulcahy /foo

OK

4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test

OK

5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar 
randomwriter /foo

java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1793)
at org.apache.hadoop.util.RunJar.main(RunJar.java:115)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

Any suggestions on why step 5. is failing even though I have write 
permissions to /foo - do I need permissions on some other directory also 
or ... ?


Thanks,

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: PIG and Hadoop

2009-06-26 Thread Thejas Nair
Hi Krishna,
 pig-u...@hadoop.apache.org is the right mailing list for pig questions.

I assume your jar file has an embedded pig script. If your jar file includes
pig.jar you don't need to specify pig.jar . Assuming that the class called
YourClass has the main function. The cmd would look like -

java -cp YourJar.jar:$HADOOPSITEPATH YourClass

See the example in -
http://hadoop.apache.org/pig/docs/r0.2.0/quickstart.html
-Thejas


On 6/25/09 10:13 PM, krishna prasanna svk_prasa...@yahoo.com wrote:

 Hi,
 
 Here is my scenario
 
 1. Having Cluster of 3 machines,
 2. Have a Jar file with includes PIG.jar.
 
 How can i run a Jar  (instead of PIG Script file) in Hadoop mode ?
 
 for running script file in hadoop mode,
 java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main
 script1-hadoop.pig
 
 any suggesions/pointers please ??
 
 appologies if i post to a wrong alias.
 
 Thanks,
 Krishna.
 
 
   ICC World Twenty20 England #39;09 exclusively on YAHOO! CRICKET
 http://cricket.yahoo.com



Re: About fuse-dfs and NFS

2009-06-26 Thread Brian Bockelman

Hey Chris,

FUSE in general does not support NFS mounts well because it has a  
tendency to renumber inodes upon NFS restart, which causes clients to  
choke.


FUSE-DFS supports a limited range of write operations; it's possible  
that your application is trying to use write functionality that is not  
supported.


Brian

On Jun 26, 2009, at 2:57 AM, XuChris wrote:




Hi,

I mount hdfs to a directory of localhost by fuse-dfs, and then  
export the directory.

When access  the directory by NFS, I can read data of the directory,
but cannot write data to the directory. Why?
Now I want to know does fuse-dfs support data-writing operation by  
NFS or not?

Who can help me? Thank you very much.

My system configure:
OS: Fedora release 8(kernel 2.6.23.1)
 For NFS, the version of fuse module has updated to 2.7.4
Fuse:2.7.4
Hadoop:0.19.1

Best regards.

Chris
2009-6-26
_
打工,挣钱,买房子,快来MClub一起”金屋藏娇”!
http://club.msn.cn/?from=10




Re: grahical tool for hadoop mapreduce

2009-06-26 Thread Tom Wheeler
Although it may not support your specific need for log files, I just
happened to run across this link today and thought it was relevant for
a thread about GUI tools for Hadoop:

   http://www.hadoopstudio.org/

It's a plugin for working visually with Hadoop in NetBeans.  The page
describes it as an alpha release, and while I haven't tried it out
yet, the screenshot at least looks very promising.

On Thu, Jun 25, 2009 at 9:28 PM, Manhee Joj...@nttdocomo.com wrote:
 Do you know any graphical tools to show the progress of mapreduce using
 the job log under logs/history/ ? The web interface (namenode:50030) gives
 me similar one. But what I need is more specific ones that show the number
 of total running map tasks and reduce tasks at some points of time,
 which I've seen from some papers. Any help would be appreciated.

-- 
Tom Wheeler
http://www.tomwheeler.com/


Error while trying to run map/reduce job

2009-06-26 Thread Usman Waheed

Hi All,

On one of the test clusters when i try to launch map/reduce job it fails 
with the following error.

/
I am getting the following error in my jobtracker.log on the namenode:/

2009-06-26 15:20:12,811 INFO org.apache.hadoop.mapred.JobTracker: Adding 
task 'attempt_200906261401_0005_m_01_0' to tip 
task_200906261401_0005_m_01, for tracker 
'tracker_datanode1:localhost/127.0.0.1:33748'
2009-06-26 15:20:14,016 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from attempt_200906261401_0005_m_01_0: java.io.IOException: 
Task process exit with nonzero status of 1.

   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

/My tasktracker log on datanode1 is reporting the following for the 
attempt noted above:


/2009-06-26 15:20:13,449 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction: attempt_200906261401_0005_m_01_0
2009-06-26 15:20:13,700 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_200906261401_0005_m_01_0 Child Error

java.io.IOException: Task process exit with nonzero status of 1.
   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
2009-06-26 15:20:14,656 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction: attempt_200906261401_0005_m_02_0
2009-06-26 15:20:14,811 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_200906261401_0005_m_02_0 Child Error

java.io.IOException: Task process exit with nonzero status of 1.
   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

Seems to be some problem with the Job not being to start on the datanode(s).
I ran hadoop fsck and the system is healthy. Checked the namenode.log 
and no errors are being reported either. These errors happen when i 
submit a job to the cluster.


Any clues or comments please?

Thanks,
Usman
/



/





Re: grahical tool for hadoop mapreduce

2009-06-26 Thread Mark Kerzner
Tom, this is so much right on time! Bravo, Karmasphere.
I installed the plugins, and nothing crashed - in fact, I get the same
screens as the manual promises.

It is worth reading this group - they released the plugin two days ago.

Mark

On Fri, Jun 26, 2009 at 10:13 AM, Tom Wheeler tomwh...@gmail.com wrote:

 Although it may not support your specific need for log files, I just
 happened to run across this link today and thought it was relevant for
 a thread about GUI tools for Hadoop:

   http://www.hadoopstudio.org/

 It's a plugin for working visually with Hadoop in NetBeans.  The page
 describes it as an alpha release, and while I haven't tried it out
 yet, the screenshot at least looks very promising.

 On Thu, Jun 25, 2009 at 9:28 PM, Manhee Joj...@nttdocomo.com wrote:
  Do you know any graphical tools to show the progress of mapreduce using
  the job log under logs/history/ ? The web interface (namenode:50030)
 gives
  me similar one. But what I need is more specific ones that show the
 number
  of total running map tasks and reduce tasks at some points of time,
  which I've seen from some papers. Any help would be appreciated.

 --
 Tom Wheeler
 http://www.tomwheeler.com/



RE: hwo to read a text file in Map function until reaching specific line

2009-06-26 Thread Ramakishore Yelamanchilli
I think map function gets the line number as key. You can ignore te other
lines after the key value 500.

Thanks

-Original Message-
From: Leiz [mailto:lzhan...@gmail.com] 
Sent: Friday, June 26, 2009 8:57 AM
To: core-user@hadoop.apache.org
Subject: hwo to read a text file in Map function until reaching specific
line


For example , I have a text file with 1000 lines.
I only want to read the first 500 line of the file.
How can I do in Map function?

Thanks


-- 
View this message in context:
http://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching
-specific-line-tp24222783p24222783.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Doing MapReduce over Har files

2009-06-26 Thread Mahadev Konar
Hi Roshan and Julian,
  The har file system can be used as a input filesystem. You can just
provide the input to map reduce as har:///something/some.har , where
some.har is your har archive. This way amp reduce will use har filesystem as
an input. The only problem being that maps cannot run across logical files
in har. 

You can specify whatever input  format these files have/had before you
included them into har archives. The point being that har:/// can be used as
a input filesystem for map reduce, which will give map reduce a view of
logical files inside of har.

Hope this helps.
mahadev


On 6/26/09 2:37 AM, jchernandez jchernan...@agnitio.es wrote:

 
 I also need help with this. I need to know how to handle a HAR file when it
 is the input to a MapReduce task. How do we read the HAR file so we can work
 on the individual logical files? I suppose we need to create our own
 InputFormat and RecordReader files, but I´m not sure how to proceed.
 
 Julian 
 
 
 Roshan James-3 wrote:
 
 When I run map reduce task over a har file as the input, I see that the
 input splits refer to 64mb byte boundaries inside the part file.
 
 My mappers only know how to process the contents of each logical file
 inside
 the har file. Is there some way by which I can take the offset range
 specified by the input split and determine which logical files lie in that
 offset range? (How else would one do map reduce over a har file?)
 
 Roshan
 
 



Re: hwo to read a text file in Map function until reaching specific line

2009-06-26 Thread Tarandeep Singh
The TextInputFormat gives byte offset in the file as key and the entire line
as value. so it won't work for you.

You can modify NLineInputFormat to achieve what you want. NLineInputformat
gives each mapper N Lines (in your case N=500)

Since you are interested in only first 500 lines of each file, the record
reader for NLineInputFormat will be implemented as-

get the input split
check the start pos
if start pos ==0
  you will read the first 500 lines
else
  you have got a file split that is in middle of the file, don't bother to
read anything as the mapper that is reading from the beginning of the file
is reading first 500 lines. Just indicate no more input.

-Tarandeep

On Fri, Jun 26, 2009 at 10:35 AM, Ramakishore Yelamanchilli 
kyela...@cisco.com wrote:

 I think map function gets the line number as key. You can ignore te other
 lines after the key value 500.

 Thanks

 -Original Message-
 From: Leiz [mailto:lzhan...@gmail.com]
 Sent: Friday, June 26, 2009 8:57 AM
 To: core-user@hadoop.apache.org
 Subject: hwo to read a text file in Map function until reaching specific
 line


 For example , I have a text file with 1000 lines.
 I only want to read the first 500 line of the file.
 How can I do in Map function?

 Thanks


 --
 View this message in context:

 http://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching
 -specific-line-tp24222783p24222783.htmlhttp://www.nabble.com/hwo-to-read-a-text-file-in-Map-function-until-reaching%0A-specific-line-tp24222783p24222783.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Using addCacheArchive

2009-06-26 Thread Chris Curtin
Hi,

I've found it much easier to write the file to HDFS use the API, then pass
the 'path' to the file in HDFS as a property. You'll need to remember to
clean up the file after you're done with it.

Example details are in this thread:
http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#

Hope this helps,

Chris

On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote:


 Please ask any questions if I am not clear above about the problem I am
 facing.

 Thanks,
 Akhil

 akhil1988 wrote:
 
  Hi All!
 
  I want a directory to be present in the local working directory of the
  task for which I am using the following statements:
 
  DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
  conf);
  DistributedCache.createSymlink(conf);
 
  Here Config is a directory which I have zipped and put at the given
  location in HDFS
 
  I have zipped the directory because the API doc of DistributedCache
  (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
 the
  archive files are unzipped in the local cache directory :
 
  DistributedCache can be used to distribute simple, read-only data/text
  files and/or more complex types such as archives, jars etc. Archives
 (zip,
  tar and tgz/tar.gz files) are un-archived at the slave nodes.
 
  So, from my understanding of the API docs I expect that the Config.zip
  file will be unzipped to Config directory and since I have SymLinked them
  I can access the directory in the following manner from my map function:
 
  FileInputStream fin = new FileInputStream(Config/file1.config);
 
  But I get the FileNotFoundException on the execution of this statement.
  Please let me know where I am going wrong.
 
  Thanks,
  Akhil
 

 --
 View this message in context:
 http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Error in Cluster Startup: NameNode is not formatted

2009-06-26 Thread Boyu Zhang
Hi all,

I am a student and I am trying to install the Hadoop on a cluster, I have
one machine running namenode, one running jobtracker, two slaves.

When I run the /bin/start-dfs.sh , there is something wrong with my
namenode, it won't start. Here is the error message in the log file:

 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
failed.
java.io.IOException: NameNode is not formatted.
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)


I think it is something stupid i did, could somebody help me out? Thanks a
lot!


Sincerely,

Boyu Zhang


RE: Permissions needed to run RandomWriter ?

2009-06-26 Thread Mulcahy, Stephen
[Apologies for the top-post, sending this from a dodgy webmail client]

Hi Alex,

My hadoop-site.xml is as follows,

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
   property
 namemapred.job.tracker/name
valuehadoop01:9001/value
   /property

  property
 namefs.default.name/name
   valuehdfs://hadoop01:9000/value
   /property

   property
 namehadoop.tmp.dir/name
 value/data1/hadoop-tmp//value
   /property

property
  namedfs.data.dir/name
  value/data1/hdfs,/data2/hdfs/value
/property
/configuration

Any comments welcome,

-stephen



-Original Message-
From: Alex Loddengaard [mailto:a...@cloudera.com]
Sent: Fri 26/06/2009 18:32
To: core-user@hadoop.apache.org
Subject: Re: Permissions needed to run RandomWriter ?
 
Hey Stephen,

What does your hadoop-site.xml look like?  The Exception is in
java.io.UnixFileSystem, which makes me think that you're actually creating
and modifying directories on your local file system instead of HDFS.  Make
sure fs.default.name looks like hdfs://your-namenode.domain.com:PORT.

Alex

On Fri, Jun 26, 2009 at 4:40 AM, stephen mulcahy
stephen.mulc...@deri.orgwrote:

 Hi,

 I've just installed a new test cluster and I'm trying to give it a quick
 smoke test with RandomWriter and Sort.

 I can run these fine with the superuser account. When I try to run them as
 another user I run into problems even though I've created the output
 directory and given permissions to the other user to write to this
 directory. i.e.

 1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo
 mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission
 denied: user=smulcahy, access=WRITE, inode=:hadoop:supergroup:rwxr-xr-x

 OK - we don't have permissions anyways

 2. had...@hadoop01:/$ hadoop fs -mkdir /foo

 OK

 3. hadoop fs -chown -R smulcahy /foo

 OK

 4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test

 OK

 5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar
 randomwriter /foo
 java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1793)
at org.apache.hadoop.util.RunJar.main(RunJar.java:115)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Any suggestions on why step 5. is failing even though I have write
 permissions to /foo - do I need permissions on some other directory also or
 ... ?

 Thanks,

 -stephen

 --
 Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
 NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
 http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com




Re: Error in Cluster Startup: NameNode is not formatted

2009-06-26 Thread Matt Massie

Boyu-

You didn't do anything stupid.  I've forgotten to format a NameNode  
too myself.


If you check the QuickStart guide at http://hadoop.apache.org/core/docs/current/quickstart.html 
 you'll see that formatting the NameNode is the first of the  
Execution section (near the bottom of the page).


The command to format the NameNode is:

hadoop namenode -format

A warning though, you should only format your NameNode once.  Just  
like formatting any filesystem, you can loss data if you (re)format.


Good luck.

-Matt

On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote:


Hi all,

I am a student and I am trying to install the Hadoop on a cluster, I  
have

one machine running namenode, one running jobtracker, two slaves.

When I run the /bin/start-dfs.sh , there is something wrong with my
namenode, it won't start. Here is the error message in the log file:

ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
failed.
java.io.IOException: NameNode is not formatted.
   at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
   at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
   at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
   at  
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)

   at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
   at  
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)

   at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)


I think it is something stupid i did, could somebody help me out?  
Thanks a

lot!


Sincerely,

Boyu Zhang




Re: Pregel

2009-06-26 Thread Saptarshi Guha
Hello,
I don't have a  background in CS, but does MS's Dryad (
http://research.microsoft.com/en-us/projects/Dryad/ ) fit in anywhere
here?
Regards
Saptarshi


On Fri, Jun 26, 2009 at 5:19 AM, Edward J. Yoonedwardy...@apache.org wrote:
 According to my understanding, I think the Pregel is in same layer
 with MR, not a MR based language processor.

 I think the 'Collective Communication' of BSP seems the core of the
 problem. For example, this BFS problem
 (http://blog.udanax.org/2009/02/breadth-first-search-mapreduce.html)
 can be solved at once w/o MR iterations.

 On Fri, Jun 26, 2009 at 3:17 PM, Owen O'Malleyomal...@apache.org wrote:

 On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote:

 my guess, as good as anybody's, is that Pregel is to large graphs is what
 Hadoop is to large datasets.

 I think it is much more likely a language that allows you to easily define
 fixed point algorithms.  I would imagine a distributed version of something
 similar to Michal Young's GenSet.
 http://portal.acm.org/citation.cfm?doid=586094.586108

 I've been trying to figure out how to justify working on a project like that
 for a couple of years, but haven't yet. (I have a background in program
 static analysis, so I've implemented similar stuff.)

 In other words, Pregel is the next natural step
 for massively scalable computations after Hadoop.

 I wonder if it uses map/reduce as a base or not. It would be easier to use
 map/reduce, but a direct implementation would be more performant. In either
 case, it is a new hammer. From what I see, it likely won't replace
 map/reduce, pig, or hive; but rather support a different class of
 applications much more directly than you can under map/reduce.

 -- Owen





 --
 Best Regards, Edward J. Yoon @ NHN, corp.
 edwardy...@apache.org
 http://blog.udanax.org



RE: Error in Cluster Startup: NameNode is not formatted

2009-06-26 Thread Boyu Zhang
Matt,

Thanks a lot for your reply! I did formatted the namenode. But I got the
same error again. And actually I successfully run the example jar file once,
but after that one time, I couldn't get it run again. I clean the /tmp dir
every time before I format namenode again(I am just testing it, so I don't
worry about losing data:). Still, I got the same error when I execute the
bin/start-dfs.sh . I checked my conf, and I can't figure out why. Here is my
conf file:

I really appreciate if you could take a look at it. Thanks a lot.
 

configuration
 
property 
 namefs.default.name/name 
 valuehdfs://hostname1:9000/value 
/property
 
 
property 
 namemapred.job.tracker/name 
 valuehostname2:9001/value 
/property



property
  namedfs.data.dir/name 
  value/data/zhang/hadoop/dfs/data/value 
  descriptionDetermines where on the local filesystem an DFS data node 
  should store its blocks.  If this is a comma-delimited 
  list of directories, then data will be stored in all named 
  directories, typically on different devices. 
  Directories that do not exist are ignored. 
  /description 
/property


property
  namemapred.local.dir/name 
  value/data/zhang/hadoop/mapred/local/value 
  descriptionThe local directory where MapReduce stores intermediate 
  data files.  May be a comma-separated list of 
  directories on different devices in order to spread disk i/o. 
  Directories that do not exist are ignored. 
  /description 
/property 
/configuration 


-Original Message-
From: Matt Massie [mailto:m...@cloudera.com] 
Sent: Friday, June 26, 2009 4:31 PM
To: core-user@hadoop.apache.org
Subject: Re: Error in Cluster Startup: NameNode is not formatted

Boyu-

You didn't do anything stupid.  I've forgotten to format a NameNode  
too myself.

If you check the QuickStart guide at
http://hadoop.apache.org/core/docs/current/quickstart.html 
  you'll see that formatting the NameNode is the first of the  
Execution section (near the bottom of the page).

The command to format the NameNode is:

hadoop namenode -format

A warning though, you should only format your NameNode once.  Just  
like formatting any filesystem, you can loss data if you (re)format.

Good luck.

-Matt

On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote:

 Hi all,

 I am a student and I am trying to install the Hadoop on a cluster, I  
 have
 one machine running namenode, one running jobtracker, two slaves.

 When I run the /bin/start-dfs.sh , there is something wrong with my
 namenode, it won't start. Here is the error message in the log file:

 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
 failed.
 java.io.IOException: NameNode is not formatted.
at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at  
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
at  
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)


 I think it is something stupid i did, could somebody help me out?  
 Thanks a
 lot!


 Sincerely,

 Boyu Zhang




Re: Error in Cluster Startup: NameNode is not formatted

2009-06-26 Thread Amandeep Khurana
Sometimes the metadata gets corrupted. Its happened with me on multiple
occasions during the initial stages of setting up the cluster. What I did
was simply delete the entire directory where the metadata and the actual
data is being stored by hdfs. Since I was playing around with the systems
and didnt care much about the data, I could do so. If it doesnt spoil
anything for you, go ahead and try it. It might work.

Secondly, you've specified the dfs.data.dir parameter but havent specified
the metadata directory. AFAIK, it will take /tmp as the default. Since /tmp
gets cleaned up, you'll lose your metadata and that could be causing the
system to not come up. Specify that parameter in the config file.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Fri, Jun 26, 2009 at 2:33 PM, Boyu Zhang boyuzhan...@gmail.com wrote:

 Matt,

 Thanks a lot for your reply! I did formatted the namenode. But I got the
 same error again. And actually I successfully run the example jar file
 once,
 but after that one time, I couldn't get it run again. I clean the /tmp dir
 every time before I format namenode again(I am just testing it, so I don't
 worry about losing data:). Still, I got the same error when I execute the
 bin/start-dfs.sh . I checked my conf, and I can't figure out why. Here is
 my
 conf file:

 I really appreciate if you could take a look at it. Thanks a lot.


 configuration

 property
  namefs.default.name/name
  valuehdfs://hostname1:9000/value
 /property


 property
  namemapred.job.tracker/name
  valuehostname2:9001/value
 /property



 property
  namedfs.data.dir/name
  value/data/zhang/hadoop/dfs/data/value
  descriptionDetermines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  /description
 /property


 property
  namemapred.local.dir/name
  value/data/zhang/hadoop/mapred/local/value
  descriptionThe local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  Directories that do not exist are ignored.
  /description
 /property
 /configuration


 -Original Message-
 From: Matt Massie [mailto:m...@cloudera.com]
 Sent: Friday, June 26, 2009 4:31 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Error in Cluster Startup: NameNode is not formatted

 Boyu-

 You didn't do anything stupid.  I've forgotten to format a NameNode
 too myself.

 If you check the QuickStart guide at
 http://hadoop.apache.org/core/docs/current/quickstart.html
  you'll see that formatting the NameNode is the first of the
 Execution section (near the bottom of the page).

 The command to format the NameNode is:

 hadoop namenode -format

 A warning though, you should only format your NameNode once.  Just
 like formatting any filesystem, you can loss data if you (re)format.

 Good luck.

 -Matt

 On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote:

  Hi all,
 
  I am a student and I am trying to install the Hadoop on a cluster, I
  have
  one machine running namenode, one running jobtracker, two slaves.
 
  When I run the /bin/start-dfs.sh , there is something wrong with my
  namenode, it won't start. Here is the error message in the log file:
 
  ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
  failed.
  java.io.IOException: NameNode is not formatted.
 at
  org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
 at
  org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
 at
  org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
 at
  org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
 at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
 at
  org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
 at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
 
 
  I think it is something stupid i did, could somebody help me out?
  Thanks a
  lot!
 
 
  Sincerely,
 
  Boyu Zhang





Scaling out/up or a mix

2009-06-26 Thread Marcus Herou
Hi.

We have a deployment of 10 hadoop servers and I now need more mapping
capability (no not just add more mappers per instance) since I have so many
jobs running. Now I am wondering what I should aim on...
Memory, cpu or disk... How long is a rope perhaps you would say ?

A typical server is currently using about 15-20% cpu today on a quad-core
2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.

Some specs below.
 mpstat 2 5
Linux 2.6.24-19-server (mapreduce2) 06/26/2009

11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft  %steal
%idleintr/s
11:36:15 PM  all   22.820.003.241.370.622.490.00
69.45   8572.50
11:36:17 PM  all   13.560.001.741.990.622.610.00
79.48   8075.50
11:36:19 PM  all   14.320.002.241.121.122.240.00
78.95   9219.00
11:36:21 PM  all   14.710.000.871.620.251.750.00
80.80   8489.50
11:36:23 PM  all   12.690.000.871.240.500.750.00
83.96   5495.00
Average: all   15.620.001.791.470.621.970.00
78.53   7970.30

What I am thinking is... Is it wiser to go for many of these cheap boxes
with 8GB of RAM or should I for instance focus on machines which can give
more I|O throughput ?

I know that these things are hard but perhaps someone have draw some
conclusions before the pragmatic way.

Kindly

//Marcus


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Error in Cluster Startup: NameNode is not formatted

2009-06-26 Thread Matt Massie
The property dfs.name.dir allows you to control where Hadoop writes  
NameNode metadata.


You should have a property like

property
namedfs.name.dir/name
value/data/zhang/hadoop/name/data/value
/property

to make sure the NameNode data isn't being deleted when you delete the  
files in /tmp.


-Matt


On Jun 26, 2009, at 2:33 PM, Boyu Zhang wrote:


Matt,

Thanks a lot for your reply! I did formatted the namenode. But I got  
the
same error again. And actually I successfully run the example jar  
file once,
but after that one time, I couldn't get it run again. I clean the / 
tmp dir
every time before I format namenode again(I am just testing it, so I  
don't
worry about losing data:). Still, I got the same error when I  
execute the
bin/start-dfs.sh . I checked my conf, and I can't figure out why.  
Here is my

conf file:

I really appreciate if you could take a look at it. Thanks a lot.


configuration

property
namefs.default.name/name
valuehdfs://hostname1:9000/value
/property


property
namemapred.job.tracker/name
valuehostname2:9001/value
/property



property
 namedfs.data.dir/name
 value/data/zhang/hadoop/dfs/data/value
 descriptionDetermines where on the local filesystem an DFS data  
node

 should store its blocks.  If this is a comma-delimited
 list of directories, then data will be stored in all named
 directories, typically on different devices.
 Directories that do not exist are ignored.
 /description
/property


property
 namemapred.local.dir/name
 value/data/zhang/hadoop/mapred/local/value
 descriptionThe local directory where MapReduce stores intermediate
 data files.  May be a comma-separated list of
 directories on different devices in order to spread disk i/o.
 Directories that do not exist are ignored.
 /description
/property
/configuration


-Original Message-
From: Matt Massie [mailto:m...@cloudera.com]
Sent: Friday, June 26, 2009 4:31 PM
To: core-user@hadoop.apache.org
Subject: Re: Error in Cluster Startup: NameNode is not formatted

Boyu-

You didn't do anything stupid.  I've forgotten to format a NameNode
too myself.

If you check the QuickStart guide at
http://hadoop.apache.org/core/docs/current/quickstart.html
 you'll see that formatting the NameNode is the first of the
Execution section (near the bottom of the page).

The command to format the NameNode is:

hadoop namenode -format

A warning though, you should only format your NameNode once.  Just
like formatting any filesystem, you can loss data if you (re)format.

Good luck.

-Matt

On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote:


Hi all,

I am a student and I am trying to install the Hadoop on a cluster, I
have
one machine running namenode, one running jobtracker, two slaves.

When I run the /bin/start-dfs.sh , there is something wrong with my
namenode, it won't start. Here is the error message in the log file:

ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
failed.
java.io.IOException: NameNode is not formatted.
  at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
  at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
  at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
  at
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:273)
  at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
  at org.apache.hadoop.dfs.NameNode.init(NameNode.java:193)
  at org.apache.hadoop.dfs.NameNode.init(NameNode.java:179)
  at
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
  at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)


I think it is something stupid i did, could somebody help me out?
Thanks a
lot!


Sincerely,

Boyu Zhang







difference between 'hadoop.tmp.dir' 'mapred.temp.dir'

2009-06-26 Thread umer arshad

Hi,

Can some body kindly explain the difference b/w 'hadoop.tmp.dir'  
'mapred.temp.dir'.
I am trying to figure out where does the intermediate temporary files are 
stored for a mapreduce job.

Thanks,
--umer

_
Invite your mail contacts to join your friends list with Windows Live Spaces. 
It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=createwx_url=/friends.aspxmkt=en-us

Re: Permissions needed to run RandomWriter ?

2009-06-26 Thread Alex Loddengaard
Have you tried to run the example job as the superuser?  It seems like this
might be an issue where hadoop.tmp.dir doesn't have the correctly
permissions.  hadoop.tmp.dir and dfs.data.dir should be owned by the unix
user running your Hadoop daemons and owner-writtable and readable.

Can you confirm this is the case?  Thanks,

Alex

On Fri, Jun 26, 2009 at 1:29 PM, Mulcahy, Stephen
stephen.mulc...@deri.orgwrote:

 [Apologies for the top-post, sending this from a dodgy webmail client]

 Hi Alex,

 My hadoop-site.xml is as follows,

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
   property
 namemapred.job.tracker/name
valuehadoop01:9001/value
   /property

  property
 namefs.default.name/name
   valuehdfs://hadoop01:9000/value
   /property

   property
 namehadoop.tmp.dir/name
 value/data1/hadoop-tmp//value
   /property

property
  namedfs.data.dir/name
  value/data1/hdfs,/data2/hdfs/value
/property
 /configuration

 Any comments welcome,

 -stephen



 -Original Message-
 From: Alex Loddengaard [mailto:a...@cloudera.com]
 Sent: Fri 26/06/2009 18:32
 To: core-user@hadoop.apache.org
 Subject: Re: Permissions needed to run RandomWriter ?

 Hey Stephen,

 What does your hadoop-site.xml look like?  The Exception is in
 java.io.UnixFileSystem, which makes me think that you're actually creating
 and modifying directories on your local file system instead of HDFS.  Make
 sure fs.default.name looks like hdfs://your-namenode.domain.com:PORT.

 Alex

 On Fri, Jun 26, 2009 at 4:40 AM, stephen mulcahy
 stephen.mulc...@deri.orgwrote:

  Hi,
 
  I've just installed a new test cluster and I'm trying to give it a quick
  smoke test with RandomWriter and Sort.
 
  I can run these fine with the superuser account. When I try to run them
 as
  another user I run into problems even though I've created the output
  directory and given permissions to the other user to write to this
  directory. i.e.
 
  1. smulc...@hadoop01:~$ hadoop fs -mkdir /foo
  mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission
  denied: user=smulcahy, access=WRITE, inode=:hadoop:supergroup:rwxr-xr-x
 
  OK - we don't have permissions anyways
 
  2. had...@hadoop01:/$ hadoop fs -mkdir /foo
 
  OK
 
  3. hadoop fs -chown -R smulcahy /foo
 
  OK
 
  4. smulc...@hadoop01:~$ hadoop fs -mkdir /foo/test
 
  OK
 
  5. smulc...@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar
  randomwriter /foo
  java.io.IOException: Permission denied
 at java.io.UnixFileSystem.createFileExclusively(Native Method)
 at java.io.File.checkAndCreate(File.java:1704)
 at java.io.File.createTempFile(File.java:1793)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:115)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
  Any suggestions on why step 5. is failing even though I have write
  permissions to /foo - do I need permissions on some other directory also
 or
  ... ?
 
  Thanks,
 
  -stephen
 
  --
  Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
  NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
  http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
 




FileStatus.getLen(): bug in documentation or bug in implememtation?

2009-06-26 Thread Dima Rzhevskiy
Hi all
I try get length of file hadoop(RawFilesysten or hdfs) .
In javadoc method  org.apache.hadoop.fs.FileStatus.getLen()  writtend that
this method return the length of this file, in blocks
But method return size in bytes.

Is this bug in documentation or implememtation?
I use  hadoop-0.18.3.


Dmitry Rzhevskiy.


Can I post pig questions on this forum?

2009-06-26 Thread pmg


-- 
View this message in context: 
http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Can I post pig questions on this forum?

2009-06-26 Thread Christophe Bisciglia
pig-u...@hadoop.apache.org

On Fri, Jun 26, 2009 at 4:34 PM, pmgparmod.me...@gmail.com wrote:


 --
 View this message in context: 
 http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.





-- 
get hadoop: cloudera.com/hadoop
online training: cloudera.com/hadoop-training
blog: cloudera.com/blog
twitter: twitter.com/cloudera


Hadoop0.20 - Class Not Found exception

2009-06-26 Thread Amandeep Khurana
I'm getting the following error while starting a MR job:

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
oracle.jdbc.driver.OracleDriver
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297)
... 21 more
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at
org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123)
at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292)
... 21 more

Interestingly, the relevant jar is bundled into the MR job jar and its also
there in the $HADOOP_HOME/lib directory.

Exactly same thing worked with 0.19.. Not sure what could have changed or I
broke to cause this error...

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: FileStatus.getLen(): bug in documentation or bug in implememtation?

2009-06-26 Thread Konstantin Shvachko

Documentation is wrong. Implementation wins.
Could you please file a bug.

Thanks,
--Konstantin

Dima Rzhevskiy wrote:

Hi all
I try get length of file hadoop(RawFilesysten or hdfs) .
In javadoc method  org.apache.hadoop.fs.FileStatus.getLen()  writtend that
this method return the length of this file, in blocks
But method return size in bytes.

Is this bug in documentation or implememtation?
I use  hadoop-0.18.3.


Dmitry Rzhevskiy.



Re: Using addCacheArchive

2009-06-26 Thread akhil1988

Thanks Chris for your reply!

Well, I could not understand much of what has been discussed on that forum.
I am unaware of Cascading.

My problem is simple - I want a directory to present in the local working
directory of tasks so that I can access it from my map task in the following
manner :

FileInputStream fin = new FileInputStream(Config/file1.config); 

where,
Config is a directory which contains many files/directories, one of which is
file1.config

It would be helpful to me if you can tell me what statements to use to
distribute a directory to the tasktrackers.
The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
that archives are unzipped on the tasktrackers but I want an example of how
to use this in case of a dreictory.

Thanks,
Akhil



Chris Curtin-2 wrote:
 
 Hi,
 
 I've found it much easier to write the file to HDFS use the API, then pass
 the 'path' to the file in HDFS as a property. You'll need to remember to
 clean up the file after you're done with it.
 
 Example details are in this thread:
 http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
 
 Hope this helps,
 
 Chris
 
 On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote:
 

 Please ask any questions if I am not clear above about the problem I am
 facing.

 Thanks,
 Akhil

 akhil1988 wrote:
 
  Hi All!
 
  I want a directory to be present in the local working directory of the
  task for which I am using the following statements:
 
  DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip),
  conf);
  DistributedCache.createSymlink(conf);
 
  Here Config is a directory which I have zipped and put at the given
  location in HDFS
 
  I have zipped the directory because the API doc of DistributedCache
  (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
 the
  archive files are unzipped in the local cache directory :
 
  DistributedCache can be used to distribute simple, read-only data/text
  files and/or more complex types such as archives, jars etc. Archives
 (zip,
  tar and tgz/tar.gz files) are un-archived at the slave nodes.
 
  So, from my understanding of the API docs I expect that the Config.zip
  file will be unzipped to Config directory and since I have SymLinked
 them
  I can access the directory in the following manner from my map
 function:
 
  FileInputStream fin = new FileInputStream(Config/file1.config);
 
  But I get the FileNotFoundException on the execution of this statement.
  Please let me know where I am going wrong.
 
  Thanks,
  Akhil
 

 --
 View this message in context:
 http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Can I post pig questions on this forum?

2009-06-26 Thread Alan Gates

pig-u...@hadoop.apache.org is the right place for pig questions.

Alan.

On Jun 26, 2009, at 4:34 PM, pmg wrote:




--
View this message in context: 
http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.





Re: Hadoop0.20 - Class Not Found exception

2009-06-26 Thread imcaptor
I meet the question also, resolved by transfer a class name to JobConf
constructor.

If you new the

*JobConf,you must transfer a class name to it.*


2009/6/27 Amandeep Khurana ama...@gmail.com

 I'm getting the following error while starting a MR job:

 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
 oracle.jdbc.driver.OracleDriver
at

 org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297)
... 21 more
 Caused by: java.lang.ClassNotFoundException:
 oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at

 org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123)
at

 org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292)
... 21 more

 Interestingly, the relevant jar is bundled into the MR job jar and its also
 there in the $HADOOP_HOME/lib directory.

 Exactly same thing worked with 0.19.. Not sure what could have changed or I
 broke to cause this error...

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz



Re: Scaling out/up or a mix

2009-06-26 Thread Brian Bockelman

Hey Marcus,

Are you recording the data rates coming out of HDFS?  Since you have  
such a low CPU utilizations, I'd look at boxes utterly packed with big  
hard drives (also, why are you using RAID1 for Hadoop??).


You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive  
bays.  Based on the data rates you see, make the call.


On the other hand, what's the argument against running 3x more mappers  
per box?  It seems that your boxes still have more overhead to use --  
there's no I/O wait.


Brian

On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:


Hi.

We have a deployment of 10 hadoop servers and I now need more mapping
capability (no not just add more mappers per instance) since I have  
so many

jobs running. Now I am wondering what I should aim on...
Memory, cpu or disk... How long is a rope perhaps you would say ?

A typical server is currently using about 15-20% cpu today on a quad- 
core

2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.

Some specs below.

mpstat 2 5

Linux 2.6.24-19-server (mapreduce2) 06/26/2009

11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft   
%steal

%idleintr/s
11:36:15 PM  all   22.820.003.241.370.622.49 
0.00

69.45   8572.50
11:36:17 PM  all   13.560.001.741.990.622.61 
0.00

79.48   8075.50
11:36:19 PM  all   14.320.002.241.121.122.24 
0.00

78.95   9219.00
11:36:21 PM  all   14.710.000.871.620.251.75 
0.00

80.80   8489.50
11:36:23 PM  all   12.690.000.871.240.500.75 
0.00

83.96   5495.00
Average: all   15.620.001.791.470.621.97 
0.00

78.53   7970.30

What I am thinking is... Is it wiser to go for many of these cheap  
boxes
with 8GB of RAM or should I for instance focus on machines which can  
give

more I|O throughput ?

I know that these things are hard but perhaps someone have draw some
conclusions before the pragmatic way.

Kindly

//Marcus


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/




Map/Reduce Errors

2009-06-26 Thread Usman Waheed

Hi All,

I had posted a question earlier regarding some not so intuitive error  
messages that I was getting on one of the clusters when trying to  
map/reduce. After many hours of googling :) i found a post that solved my  
problem.


http://www.mail-archive.com/core-user@hadoop.apache.org/msg07202.html.

One of our engineers ran way too many jobs that created enormous subdirs  
in $HADOOP_HOME/logs/userlogs. Deleting these subdirs under  
$HADOOP_HOME/logs/userlogs/ on the datanodes solved the problem. You can  
also set the cleanup in the hadoop-default.xml file by setting the cleanup  
time to x hours instead of 24. The specific param is userlogs.retain.


Just wanted to share this with you all.

Thanks,
Usman


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


HDFS Random Access

2009-06-26 Thread tsuraan
All the documentation for HDFS says that it's for large streaming
jobs, but I couldn't find an explicit answer to this, so I'll try
asking here.  How is HDFS's random seek performance within an
FSDataInputStream?  I use lucene with a lot of indices (potentially
thousands), so I was thinking of putting them into HDFS and
reimplementing my search as a Hadoop map-reduce.  I've noticed that
lucene tends to do a bit of random seeking when searching though; I
don't believe that it guarantees that all seeks be to increasing file
positions either.

Would HDFS be a bad fit for an access pattern that involves seeks to
random positions within a stream?

Also, is getFileStatus the typical way of getting the length of a file
in HDFS, or is there some method on FSDataInputStream that I'm not
seeing?

Please cc: me on any reply; I'm not on the hadoop list.  Thanks!