Re: Basic code organization questions + scheduling

2008-09-08 Thread tarjei
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Alex (and others).

 You should take a look at Nutch.  It's a search-engine built on Lucene,
 though it can be setup on top of Hadoop.  Take a look:
This didn't help me much. Although the description I gave of the basic
flow of the app seems to be close to what Nutch is doing (and I've been
looking at the Nutch code), the questions are more general and not
related to indexing as such, but about code organization. If someone has
more input to those, feel free to add it.


  On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote:
 
 Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
 tasks. The basic flow is

 input:array of urls
 actions:  |
 1.  get pages
  |
 2.  extract new urls from pages - start new job
extract text  - index / filter (as new jobs)

 What I'm considering is how I should build this application to fit into the
 map/reduce context. I'm thinking that step 1 and 2 should be separate
 map/reduce tasks that then pipe things on to the next step.

 This is where I am a bit at loss to see how it is smart to organize the
 code in logical units and also how to spawn new tasks when an old one is
 over.

 Is the usual way to control the flow of a set of tasks to have an external
 application running that listens to jobs ending via the endNotificationUri
 and then spawns new tasks or should the job itself contain code to create
 new jobs? Would it be a good idea to use Cascading here?

 I'm also considering how I should do job scheduling (I got a lot of
 reoccurring tasks). Has anyone found a good framework for job control of
 reoccurring tasks or should I plan to build my own using quartz ?

 Any tips/best practices with regard to the issues described above are most
 welcome. Feel free to ask further questions if you find my descriptions of
 the issues lacking.

Kind regards,
Tarjei


 Kind regards,
 Tarjei



 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
9km8MTJcTQxnc7bijR1Oxs0=
=79fZ
-END PGP SIGNATURE-


RE: task assignment managemens.

2008-09-08 Thread Dmitry Pushkarev
How about just specify machines to run the task on? I haven't seen it
anywhere..

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Sunday, September 07, 2008 9:55 PM
To: core-user@hadoop.apache.org
Subject: Re: task assignment managemens.

No that is not possible today. However, you might want to look at the
TaskScheduler to see if you can implement a scheduler to provide this kind
of task scheduling.

In the current hadoop, one point regarding computationally intensive task is
that if the machine is not able to keep up with the rest of the machines
(and the task on that machine is running slower than others), speculative
execution, if enabled, can help a lot. Also, implicitly, faster/better
machines get more work than the slower machines.


On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:

 Dear Hadoop users,
 
  
 
 Is it possible without using java manage task assignment to implement some
 simple rules?  Like do not launch more that 1 instance of crawling task
on
 a machine, and do not run data intensive tasks on remote machines, and do
 not run computationally intensive tasks on single-core machines:etc.
 
  
 
 Now it's done by failing tasks that decided to run on a wrong machine, but
I
 hope to find some solution on jobtracker side..
 
  
 
 ---
 
 Dmitry
 




Re: HBase, Hive, Pig and other Hadoop based technologies

2008-09-08 Thread Naama Kraus
Both Pig and Hive are written on top of Hadoop, is that correct ? Does this
mean for instance that they would work with any implementation of the Hadoop
File System interface ? Whether it is HDFS, KFS, LocalFS, or any future
implementation that may be implemented ?
Is that true for HBase as well ?

I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
that right ?

Thanks, Naama

On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus [EMAIL PROTECTED] wrote:

 Hi,

 There are various technologies on top of Hadoop such as HBase, Hive, Pig
 and more. I was wondering what are the differences between them. What are
 the usage scenarios that fit each one of them.

 For instance, is it true to say that Pig and Hive belong to the same family
 ? Or is Hive more close to HBase ?
 My understanding is that HBase allows direct lookup and low latency
 queries, while Pig and Hive provide batch processing operations which are
 M/R based. Both define a data model and an SQL-like query language. Is this
 true ?

 Could anyone shed light on when to use each technology ? Main differences ?
 Pros and Cons ?
 Information on other technologies such as Jaql is also welcome.

 Thanks, Naama

 --
 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
 00 oo 00 oo
 If you want your children to be intelligent, read them fairy tales. If you
 want them to be more intelligent, read them more fairy tales. (Albert
 Einstein)




-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales. (Albert
Einstein)


Re: Getting started questions

2008-09-08 Thread Sagar Naik

Dennis Kubes wrote:



John Howland wrote:

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If 
anyone

could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents 
ranging

from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?


AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be 
similar to the way the Fetcher works in Nutch.  10M documents in a 
sequence/map file on DFS is comparatively small and can be handled 
efficiently.




 - The bodies of the documents usually range from 1K-100KB in size, 
but some

outliers can be as big as 4-5GB.


I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If 
it does you can always store as a BytesWritable which is just an array 
of bytes.  But you are going to have memory issues reading in and 
writing out that large of a record.
 - I will also need to store some metadata for each document which I 
figure

could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard 
operations

on the bodies, like word frequency and searching.


It is possible to create an OutputFormat that writes out multiple 
files.  You could also use a MapWritable as the value to store the 
document and associated metadata.




Is there a canned FileInputFormat that makes sense? Should I roll my 
own?
How can I access the bodies as streams so I don't have to read them 
into RAM


A writable is read into RAM so even treating it like a stream doesn't 
get around that.


One thing you might want to consider is to  tar up say X documents at 
a time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The 
job will be slower because you are doing it this way but it is a 
solution to handling such large documents as streams.


all at once? Am I right in thinking that I should treat each document 
as a

record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no 
reduction),
where I'm calculating new metadata fields on each document. To end up 
with a
good result set, I'll need to copy the entire input record + new 
fields into
another set of output files. Is there a better way? I haven't wanted 
to go

down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of 
reference

on the datanodes.


If you don't specify a reducer, the IdentityReducer is run which 
simply passes through output.
   One can set number of reducers to zero and reduce phase will not 
take place.




3. I'm sure this is not a new idea, but I haven't seen anything 
regarding
it... I'll need to run several MR jobs as a pipeline... is there any 
way for
the map tasks in a subsequent stage to begin processing data from 
previous

stage's reduce task before that reducer has fully finished?


Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis


Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more 
real.

A whole heap of thanks in advance,

John





Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-08 Thread Espen Amble Kolstad
There's a JIRA on this already:
https://issues.apache.org/jira/browse/HADOOP-3831
Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
to do the trick for now.

Espen

On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad [EMAIL PROTECTED] wrote:
 Hi,

 Thanks for the tip!

 I tried revision 692572 of the 0.18 branch, but I still get the same errors.

 On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
 The DFS errors might have been caused by

 http://issues.apache.org/jira/browse/HADOOP-4040

 thanks,
 dhruba

 On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das [EMAIL PROTECTED] wrote:
  These exceptions are apparently coming from the dfs side of things. Could
  someone from the dfs side please look at these?
 
  On 9/5/08 3:04 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote:
  Hi,
 
  Thanks!
  The patch applies without change to hadoop-0.18.0, and should be
  included in a 0.18.1.
 
  However, I'm still seeing:
  in hadoop.log:
  2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
  from blk_3428404120239503595_2664 of
  /user/trank/segments/20080905102650/crawl_generate/part-00010 from
  somehost:50010: java.io.IOException: Premeture EOF from in
  putStream
 
  in datanode.log:
  2008-09-05 11:15:09,554 WARN  dfs.DataNode -
  DatanodeRegistration(somehost:50010,
  storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
  ipcPort=50020):Got exception while serving
  blk_-4682098638573619471_2662 to
  /somehost:
  java.net.SocketTimeoutException: 48 millis timeout while waiting
  for channel to be ready for write. ch :
  java.nio.channels.SocketChannel[connected local=/somehost:50010
  remote=/somehost:45244]
 
  These entries in datanode.log happens a few minutes apart repeatedly.
  I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
  free memory (so it's not resource starvation).
 
  Espen
 
  On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das [EMAIL PROTECTED] wrote:
  I started a profile of the reduce-task. I've attached the profiling
  output. It seems from the samples that ramManager.waitForDataToMerge()
  doesn't actually wait.
  Has anybody seen this behavior.
 
  This has been fixed in HADOOP-3940
 
  On 9/4/08 6:36 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote:
  I have the same problem on our cluster.
 
  It seems the reducer-tasks are using all cpu, long before there's
  anything to
  shuffle.
 
  I started a profile of the reduce-task. I've attached the profiling
  output. It seems from the samples that ramManager.waitForDataToMerge()
  doesn't actually wait.
  Has anybody seen this behavior.
 
  Espen
 
  On Thursday 28 August 2008 06:11:42 wangxu wrote:
  Hi,all
  I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
  and running hadoop on one namenode and 4 slaves.
  attached is my hadoop-site.xml, and I didn't change the file
  hadoop-default.xml
 
  when data in segments are large,this kind of errors occure:
 
  java.io.IOException: Could not obtain block:
  blk_-2634319951074439134_1129
  file=/user/root/crawl_debug/segments/20080825053518/content/part-
 2/data at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
 nt.jav a:1462) at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
 java:1 312) at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
 17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
  at
  org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
 ava:64 ) at
  org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
 ) at
  org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
 :1646) at
  org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
 ile.ja va:1712) at
  org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
 .java: 1787) at
  org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
 uenceF ileRecordReader.java:104) at
  org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
 cordRe ader.java:79) at
  org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
 eader. java:112) at
  org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
 dReade r.java:130) at
  org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
 (Compo siteRecordReader.java:398) at
  org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
 java:5 6) at
  org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
 java:3 3) at
  org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
 a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
  org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
  at
  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
 )
 
 
  how can I correct this?
  thanks.
  Xu




Re: critical name node problem

2008-09-08 Thread Steve Loughran

Allen Wittenauer wrote:



On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:


Another idea would be a tool or namenode startup mode that would make it
ignore EOFExceptions to recover as much of the edits as possible.


We clearly need to change the how to configure docs to make sure
people put at least two directories on two different storage systems for the
dfs.name.dir  .  This problem seems to happen quite often, and having two+
dirs helps protect against it.

We recently had one of the disks on one of our copies go bad.  The
system kept going just fine until we had a chance to reconfig the name node.

That said, I've just HADOOP-4080 to help alert admins in these
situations.




that and HADOOP-4081.

Apache Axis has this production/development switch; in develop mode it 
sends stack traces over the wire and is generally more forgiving. By 
default it assumes you are in production rather than development, so you 
have to explicitly flip the switch to get slighly reduced security.


Hadoop could have something similar, where if the  production flag is 
set, the cluster would simply refuse to come up if it felt the 
configuration wasn't robust enough.


Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-08 Thread Espen Amble Kolstad
Hi,

Thanks for the tip!

I tried revision 692572 of the 0.18 branch, but I still get the same errors.

On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
 The DFS errors might have been caused by

 http://issues.apache.org/jira/browse/HADOOP-4040

 thanks,
 dhruba

 On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das [EMAIL PROTECTED] wrote:
  These exceptions are apparently coming from the dfs side of things. Could
  someone from the dfs side please look at these?
 
  On 9/5/08 3:04 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote:
  Hi,
 
  Thanks!
  The patch applies without change to hadoop-0.18.0, and should be
  included in a 0.18.1.
 
  However, I'm still seeing:
  in hadoop.log:
  2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
  from blk_3428404120239503595_2664 of
  /user/trank/segments/20080905102650/crawl_generate/part-00010 from
  somehost:50010: java.io.IOException: Premeture EOF from in
  putStream
 
  in datanode.log:
  2008-09-05 11:15:09,554 WARN  dfs.DataNode -
  DatanodeRegistration(somehost:50010,
  storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
  ipcPort=50020):Got exception while serving
  blk_-4682098638573619471_2662 to
  /somehost:
  java.net.SocketTimeoutException: 48 millis timeout while waiting
  for channel to be ready for write. ch :
  java.nio.channels.SocketChannel[connected local=/somehost:50010
  remote=/somehost:45244]
 
  These entries in datanode.log happens a few minutes apart repeatedly.
  I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
  free memory (so it's not resource starvation).
 
  Espen
 
  On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das [EMAIL PROTECTED] wrote:
  I started a profile of the reduce-task. I've attached the profiling
  output. It seems from the samples that ramManager.waitForDataToMerge()
  doesn't actually wait.
  Has anybody seen this behavior.
 
  This has been fixed in HADOOP-3940
 
  On 9/4/08 6:36 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote:
  I have the same problem on our cluster.
 
  It seems the reducer-tasks are using all cpu, long before there's
  anything to
  shuffle.
 
  I started a profile of the reduce-task. I've attached the profiling
  output. It seems from the samples that ramManager.waitForDataToMerge()
  doesn't actually wait.
  Has anybody seen this behavior.
 
  Espen
 
  On Thursday 28 August 2008 06:11:42 wangxu wrote:
  Hi,all
  I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
  and running hadoop on one namenode and 4 slaves.
  attached is my hadoop-site.xml, and I didn't change the file
  hadoop-default.xml
 
  when data in segments are large,this kind of errors occure:
 
  java.io.IOException: Could not obtain block:
  blk_-2634319951074439134_1129
  file=/user/root/crawl_debug/segments/20080825053518/content/part-
 2/data at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
 nt.jav a:1462) at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
 java:1 312) at
  org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
 17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
  at
  org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
 ava:64 ) at
  org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
 ) at
  org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
 :1646) at
  org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
 ile.ja va:1712) at
  org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
 .java: 1787) at
  org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
 uenceF ileRecordReader.java:104) at
  org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
 cordRe ader.java:79) at
  org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
 eader. java:112) at
  org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
 dReade r.java:130) at
  org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
 (Compo siteRecordReader.java:398) at
  org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
 java:5 6) at
  org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
 java:3 3) at
  org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
 a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
  org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
  at
  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
 )
 
 
  how can I correct this?
  thanks.
  Xu



Re: DFS temporary files?

2008-09-08 Thread Steve Loughran

Owen O'Malley wrote:

Currently there isn't a way to do that. In Hadoop 0.19, there will be a way
to have a clean up method that runs at the end of the job. See
HADOOP-3150https://issues.apache.org/jira/browse/HADOOP-3150
.



another bit of feature creep would be an expires: attribute on files, 
and something to purge expired files every so often. Which ensures that 
even if a job dies or the entire cluster is reset, stuff gets cleaned up


Before someone rushes to implement this, I've been burned in the past by 
differences in a clusters machines and clocks. Even if everything really 
is in sync with NTP, and not configured to talk to a NTP server that the 
production site can't see, you still need to be 100% that all your boxes 
are in the same time zone.


-steve


race condition in SequenceFileOutputFormat.getReaders() ?

2008-09-08 Thread Barry Haddow
Hi 

Using the nutch tools I see fairly frequent crashes in 
SequenceFileOutputFormat.getReaders() with stack traces like the one below. 
What appears to be happending is that there's a temporary file inside the 
generate-temp-1220879127849 which exists when getReaders() lists the contents 
of the directory, but has been deleted by the time it goes to examine the 
contents. 

Since I'm using nutch, this is in Hadoop version 0.15, but the code for 
getReaders() doesn't seem to have changed in 0.18. Is this a known problem?

regards
Barry

2008-09-08 14:07:04,429 FATAL crawl.Generator - Generator: 
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open 
filename 
/tmp/hadoop-bhaddow/mapred/temp/generate-temp-1220879127849/_task_200809081337_0019_r_04_1
at org.apache.hadoop.dfs.NameNode.open(NameNode.java:238)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)

at org.apache.hadoop.ipc.Client.call(Client.java:482)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:848)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.init(DFSClient.java:840)
at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:285)
at 
org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:114)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1356)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1349)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1344)
at 
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:87)
at org.apache.nutch.crawl.Generator.generate(Generator.java:443)
at org.apache.nutch.crawl.Generator.run(Generator.java:580)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54)
at org.apache.nutch.crawl.Generator.main(Generator.java:543)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



About the nama of second-namenode!!

2008-09-08 Thread 叶双明
Because the name of second-namenode making so much confusing, does the
hadoop team consider to change it?

-- 
Sorry for my english!!
明


Re: Basic code organization questions + scheduling

2008-09-08 Thread Chris K Wensel


If you wrote a simple URL fetcher function for Cascading, you would  
have a very powerful web crawler that would dwarf Nutch in flexibility.


That said, Nutch is optimized for storage, has supporting tools,  
ranking algorithms, and has been up against some nasty html and other  
document types. building a really robust crawler is non-trivial.


If i was just starting out and needed to implement a proprietary  
process, I would use Nutch for fetching raw content, and refreshing  
it. then use Cascading for parsing, indexing, etc.


cheers,
chris

On Sep 8, 2008, at 12:42 AM, tarjei wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Alex (and others).

You should take a look at Nutch.  It's a search-engine built on  
Lucene,

though it can be setup on top of Hadoop.  Take a look:

This didn't help me much. Although the description I gave of the basic
flow of the app seems to be close to what Nutch is doing (and I've  
been

looking at the Nutch code), the questions are more general and not
related to indexing as such, but about code organization. If someone  
has

more input to those, feel free to add it.



On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote:

Hi, I'm planning to use Hadoop in for a set of typical crawler/ 
indexer

tasks. The basic flow is

input:array of urls
actions:  |
1.  get pages
|
2.  extract new urls from pages - start new job
  extract text  - index / filter (as new jobs)

What I'm considering is how I should build this application to fit  
into the
map/reduce context. I'm thinking that step 1 and 2 should be  
separate

map/reduce tasks that then pipe things on to the next step.

This is where I am a bit at loss to see how it is smart to  
organize the
code in logical units and also how to spawn new tasks when an old  
one is

over.

Is the usual way to control the flow of a set of tasks to have an  
external
application running that listens to jobs ending via the  
endNotificationUri
and then spawns new tasks or should the job itself contain code to  
create

new jobs? Would it be a good idea to use Cascading here?

I'm also considering how I should do job scheduling (I got a lot of
reoccurring tasks). Has anyone found a good framework for job  
control of

reoccurring tasks or should I plan to build my own using quartz ?

Any tips/best practices with regard to the issues described above  
are most
welcome. Feel free to ask further questions if you find my  
descriptions of

the issues lacking.


Kind regards,
Tarjei



Kind regards,
Tarjei







-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o
9km8MTJcTQxnc7bijR1Oxs0=
=79fZ
-END PGP SIGNATURE-


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Thinking about retriving DFS metadata from datanodes!!!

2008-09-08 Thread 叶双明
Hi all.

We all know the importance of NameNode for a cluster, when the NameNode
whith all matadata of the DFS is breaked down,  the whole of the
clustes'data is gone.

So, if we can retrive DFS metadata from datanodes, It should be a additional
great robustness for hadoop.

Does any one thinking It's valuable for it?

-- 
Sorry for my english!!
明


Re: public IP for datanode on EC2

2008-09-08 Thread Julien Nioche
I have tried using *slave.host.name* and give it the public address of my
data node. I can now see the node listed with its public address on the
dfshealth.jsp, however when I try to send a file to the HDFS from my
external server I still get :

*08/09/08 15:58:41 INFO dfs.DFSClient: Waiting to find target node:
10.251.75.177:50010
08/09/08 15:59:50 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException
08/09/08 15:59:50 INFO dfs.DFSClient: Abandoning block
blk_-8257572465338588575
08/09/08 15:59:50 INFO dfs.DFSClient: Waiting to find target node:
10.251.75.177:50010
08/09/08 15:59:56 WARN dfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2246)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)

08/09/08 15:59:56 WARN dfs.DFSClient: Error Recovery for block
blk_-8257572465338588575 bad datanode[0]*

Is there another parameter I could specify to force the address of my
datanode? I have been searching on the EC2 forums and documentation and
apparently there is no way I can use *dfs.datanode.dns.interface* or *
dfs.datanode.dns.nameserver* to specify the public IP of my instance.

Has anyone else managed to send/retrieve stuff from HDFS on an EC2 cluster
from an external machine?

Thanks

Julien


2008/9/5 Julien Nioche [EMAIL PROTECTED]

 Hi guys,

 I am using Hadoop on a EC2 cluster and am trying to send files onto the
 HDFS from an external machine. It works up to the point where I get this
 error message :
 *Waiting to find target node: 10.250.7.148:50010*

 I've seen a discussion about a similar issue on *
 http://thread.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/2446/focus=2449
 * but there are no details on how to fix the problem.

 Any idea about how I can set up my EC2 instances so that they return their
 public IPs and not the internal Amazon ones? Anything I can specify for the
 parameters *dfs.datanode.dns.interface* and *dfs.datanode.dns.nameserver*?


 What I am trying to do is to put my input to be processed onto the HDFS and
 retrieve the output from there. What I am not entirely sure of is whether I
 can launch my job from the external machine. Most people seem to SSH to the
 master to do that.

 Thanks

 Julien
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com




-- 
DigitalPebble Ltd
http://www.digitalpebble.com


RE: HBase, Hive, Pig and other Hadoop based technologies

2008-09-08 Thread Jim Kellerman
Comments inline below:
 -Original Message-
 From: Naama Kraus [mailto:[EMAIL PROTECTED]
 Sent: Monday, September 08, 2008 1:00 AM
 To: hadoop-core; [EMAIL PROTECTED]
 Subject: Re: HBase, Hive, Pig and other Hadoop based technologies

 Both Pig and Hive are written on top of Hadoop, is that correct ?

Yes.

 Does this mean for instance that they would work with any implementation
 of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS,
 or any future implementation that may be implemented ?

Yes.

 Is that true for HBase as well ?

Yes.

 I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is
 that right ?

Yes.

 Thanks, Naama

 On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus [EMAIL PROTECTED] wrote:

  Hi,
 
  There are various technologies on top of Hadoop such as HBase, Hive, Pig
  and more. I was wondering what are the differences between them. What are
  the usage scenarios that fit each one of them.
 
  For instance, is it true to say that Pig and Hive belong to the same family
  ? Or is Hive more close to HBase ?
  My understanding is that HBase allows direct lookup and low latency
  queries, while Pig and Hive provide batch processing operations which are
  M/R based. Both define a data model and an SQL-like query language. Is this
  true ?
 
  Could anyone shed light on when to use each technology ? Main differences ?
  Pros and Cons ?
  Information on other technologies such as Jaql is also welcome.
 
  Thanks, Naama
 
  --
  oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
  00 oo 00 oo
  If you want your children to be intelligent, read them fairy tales. If you
  want them to be more intelligent, read them more fairy tales. (Albert
  Einstein)
 



 --
 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
 00 oo 00 oo
 If you want your children to be intelligent, read them fairy tales. If you
 want them to be more intelligent, read them more fairy tales. (Albert
 Einstein)


Re: public IP for datanode on EC2

2008-09-08 Thread Steve Loughran

Julien Nioche wrote:

I have tried using *slave.host.name* and give it the public address of my
data node. I can now see the node listed with its public address on the
dfshealth.jsp, however when I try to send a file to the HDFS from my
external server I still get :

*08/09/08 15:58:41 INFO dfs.DFSClient: Waiting to find target node:
10.251.75.177:50010
08/09/08 15:59:50 INFO dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException
08/09/08 15:59:50 INFO dfs.DFSClient: Abandoning block
blk_-8257572465338588575
08/09/08 15:59:50 INFO dfs.DFSClient: Waiting to find target node:
10.251.75.177:50010
08/09/08 15:59:56 WARN dfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2246)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)

08/09/08 15:59:56 WARN dfs.DFSClient: Error Recovery for block
blk_-8257572465338588575 bad datanode[0]*

Is there another parameter I could specify to force the address of my
datanode? I have been searching on the EC2 forums and documentation and
apparently there is no way I can use *dfs.datanode.dns.interface* or *
dfs.datanode.dns.nameserver* to specify the public IP of my instance.

Has anyone else managed to send/retrieve stuff from HDFS on an EC2 cluster
from an external machine?


I think most people try to avoid allowing remote access for security 
reasons. If you can add a file, I can mount your filesystem too, maybe 
even delete things. Whereas with EC2-only filesystems, your files are 
*only* exposed to everyone else that knows or can scan for your IPAddr 
and ports.


Re: Getting started questions

2008-09-08 Thread John Howland
Dennis,

Thanks for the detailed response. I need to play with the SequenceFile
format a bit -- I found the documentation for it on the wiki. I think
I could build on top of the format to handle storage of very large
documents. The vast majority of documents will fit into RAM and in a
standard HDFS block (64MB, maybe up it to 128MB). For very large
documents, I can split them into consecutive records in the
SequenceFile. I can overload the key to be a combination of a real
key and a record number... Shouldn't be too hard to extend
SequenceFile to do this.

Much obliged,

John


Re: can i run multiple datanode in one pc?

2008-09-08 Thread Raghu Angadi

 2008-09-05 10:11:20,106 ERROR org.apache.hadoop.dfs.DataNode:
 java.net.BindException: Problem binding to /0.0.0.0:50010

Should you check if something is listening on 50010? I strongly
encourage users to go through these logs when something does not work.

Regd running multiple datanodes :

Along with config directory, you should change the following env
variables : HADOOP_LOG_DIR and HADOOP_PID_DIR (I set both to the same
value).

The following variables should be different in hadoop-site.xml :
1. dfs.data.dir or hadoop.tmp.dir
2. dfs.datanode.address
3. dfs.datanode.http.address
4. dfs.datanode.ipc.address

Raghu.

叶双明 wrote:
 In addition, I can't start datanode in another computer use command:
 bin/hadoop-daemons.sh --config conf start datanode, by the default config,
 log message is:
 
 2008-09-05 10:11:19,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = testcenter/192.168.100.120
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.17.1
 STARTUP_MSG:   build =
 http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344;
 compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008
 /
 2008-09-05 10:11:20,097 INFO org.apache.hadoop.dfs.DataNode: Registered
 FSDatasetStatusMBean
 2008-09-05 10:11:20,106 ERROR org.apache.hadoop.dfs.DataNode:
 java.net.BindException: Problem binding to /0.0.0.0:50010
 at org.apache.hadoop.ipc.Server.bind(Server.java:175)
 at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:264)
 at org.apache.hadoop.dfs.DataNode.init(DataNode.java:171)
 at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2765)
 at
 org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2720)
 at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2728)
 at org.apache.hadoop.dfs.DataNode.main(DataNode.java:2850)
 
 2008-09-05 10:11:20,107 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at testcenter/192.168.100.120
 /
 
 Maybe there is some wrong network config???
 
 2008/9/5 叶双明 [EMAIL PROTECTED]
 
 I can start first datanode by
 bin/hadoop-daemons.sh --config conf1 start datanode

 and copy conf1 to conf2, and modify the value for follwing properties, port
 = port +1

 property
   namedfs.datanode.address/name
   value0.0.0.0:50011/value
   description
 The address where the datanode server will listen to.
 If the port is 0 then the server will start on a free port.
   /description
 /property

 property
   namedfs.datanode.http.address/name
   value0.0.0.0:50076/value
   description
 The datanode http server address and port.
 If the port is 0 then the server will start on a free port.
   /description
 /property

 property
   namedfs.http.address/name
   value0.0.0.0:50071/value
   description
 The address and the base port where the dfs namenode web ui will listen
 on.
 If the port is 0 then the server will start on a free port.
   /description
 /property

 property
   namedfs.datanode.https.address/name
   value0.0.0.0:50476/value
 /property

 And run bin/hadoop-daemons.sh --config conf2 start datanode
 but get message: datanode running as process 4137. Stop it first.

 what other properties could I modify between the two config file.

 Thanks.

 2008/9/5 lohit [EMAIL PROTECTED]

 Good way is to have different 'conf' dirs
 So, you would end up with dir conf1 and conf2
 and startup of datanode would be
 ./bin/hadoop-daemons.sh --config conf1 start datanode
 ./bin/hadoop-daemons.sh --config conf2 start datanode


 make sure you have different hadoop-site.xml in conf1 and conf2 dirs.

 -Lohit

 - Original Message 
 From: 叶双明 [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Thursday, September 4, 2008 12:01:48 AM
 Subject: Re: can i run multiple datanode in one pc?

 Thanks lohit.

 I run start datanod by comman: bin/hadoop datanode -conf
 conf/hadoop-site.xml, it can't work, but command: bin/hadoop datanode can
 work.

 Something wrong have I done?



 bin/hadoop datanode -conf conf/hadoop-site.xml


 2008/9/4 lohit [EMAIL PROTECTED]

 Yes, each datanode should point to different config.
 So, if you have conf/hadoop-site.xml make another conf2/hadoop-site.xml
 with ports for datanode specific stuff and you should be able to start
 multiple datanodes on same node.
 -Lohit



 - Original Message 
 From: 叶双明 [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Wednesday, September 3, 2008 8:19:59 PM
 Subject: can i run multiple datanode in one pc?

 i know it is running one datanode in one computer normally。
 i wondering can i run multiple datanode in one pc?






Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-08 Thread Raghu Angadi


There is certainly value in it. But it can not be done with the data 
datanodes currently have.


The usual way to protect NameNode metadata in Hadoop is to write the 
metadata to two different locations.


Raghu.

叶双明 wrote:

Hi all.

We all know the importance of NameNode for a cluster, when the NameNode
whith all matadata of the DFS is breaked down,  the whole of the
clustes'data is gone.

So, if we can retrive DFS metadata from datanodes, It should be a additional
great robustness for hadoop.

Does any one thinking It's valuable for it?





Re: Hadoop + Elastic Block Stores

2008-09-08 Thread Doug Cutting

Ryan LeCompte wrote:
I'd really love to one day 
see some scripts under src/contrib/ec2/bin that can setup/mount the EBS 
volumes automatically. :-)


The fastest way might be to write  contribute such scripts!

Doug


Simple Survey

2008-09-08 Thread Chris K Wensel

Hey all

Scale Unlimited is putting together some case studies for an upcoming  
class and wants to get a snapshot of what the Hadoop user community  
looks like.


If you have 2 minutes, please feel free to take the short anonymous  
survey below:


http://www.scaleunlimited.com/survey.html

All results will be public.

cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: About the nama of second-namenode!!

2008-09-08 Thread Doug Cutting
Changing it will unfortunately cause confusion too.  Sigh.  This is why 
we should take time to name things well the first time.


Doug

叶双明 wrote:

Because the name of second-namenode making so much confusing, does the
hadoop team consider to change it?



Re: is SecondaryNameNode in support for the NameNode?

2008-09-08 Thread Konstantin Shvachko

NameNodeFailover http://wiki.apache.org/hadoop/NameNodeFailover, with a
SecondaryNameNode http://wiki.apache.org/hadoop/SecondaryNameNode hosted

I think it is wrong, please correct it.


You probably look at some cached results. Both pages do not exist.
The first one was a cause of confusion and was removed.

Regards,
--Konstantin


2008/9/6, Jean-Daniel Cryans [EMAIL PROTECTED]:

Hi,

See http://wiki.apache.org/hadoop/FAQ#7 and


http://hadoop.apache.org/core/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode

Regards,

J-D

On Sat, Sep 6, 2008 at 5:26 AM, ??? [EMAIL PROTECTED] wrote:


Hi all!

The NameNode is a Single Point of Failure for the HDFS Cluster. There
is support for NameNodeFailover, with a SecondaryNameNode hosted on a
separate machine being able to stand in for the original NameNode if
it goes down.

Is it right? is SecondaryNameNode in support  for the NameNode?

Sorry for my englist!!
?






Re: specifying number of nodes for job

2008-09-08 Thread hmarti2
yeah. snickerdoodle. really.


 I see.. so if I have a cluster with n nodes, there is no way for me to
 have
 it spawn on just 2 of those nodes, or just one of those nodes? And
 furthermore, there is no way for me to have it spawn on just a subset of
 the
 processors? Or am I misunderstanding?

 Also, when you say specify the number of tasks for each node are you
 referring to specifying the number of mappers and reducers I can spawn on
 each node?

 -SM

 On Sun, Sep 7, 2008 at 8:29 PM, Mafish Liu [EMAIL PROTECTED] wrote:

 On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote:

  Hi,
 
  This may be a silly question, but I'm strangely having trouble finding
 an
  answer for it (perhaps I'm looking in the wrong places?).
 
  Suppose I have a cluster with n nodes each with m processors.
 
  I wish to test the performance of, say,  the wordcount program on k
  processors, where k is varied from k = 1 ... nm.


 You can  specify the number of tasks for each node in your
 hadoop-site.xml
 file.
 So you can get k varied from k = n, 2*nm*n instead of k = 1...nm.


  How would I do this? I'm having trouble finding the proper command
 line
  option in the commands manual (
  http://hadoop.apache.org/core/docs/current/commands_manual.html)
 
 
 
  Thank you very much for you time.
 
  -SM
 



 --
 [EMAIL PROTECTED]
 Institute of Computing Technology, Chinese Academy of Sciences, Beijing.






Re: distcp failing

2008-09-08 Thread Aaron Kimball
It is likely that you mapred.system.dir and/or fs.default.name settings are
incorrect on the non-datanode machine that you are launching the task from.
These two settings (in your conf/hadoop-site.xml file) must match the
settings on the cluster itself.

- Aaron

On Sun, Sep 7, 2008 at 8:58 PM, Michael Di Domenico
[EMAIL PROTECTED]wrote:

 I'm attempting to load data into hadoop (version 0.17.1), from a
 non-datanode machine in the cluster.  I can run jobs and copyFromLocal
 works
 fine, but when i try to use distcp i get the below.  I'm don't understand
 what the error, can anyone help?
 Thanks

 blue:hadoop-0.17.1 mdidomenico$ time bin/hadoop distcp -overwrite
 file:///Users/mdidomenico/hadoop/1gTestfile /user/mdidomenico/1gTestfile
 08/09/07 23:56:06 INFO util.CopyFiles:
 srcPaths=[file:/Users/mdidomenico/hadoop/1gTestfile]
 08/09/07 23:56:06 INFO util.CopyFiles:
 destPath=/user/mdidomenico/1gTestfile1
 08/09/07 23:56:07 INFO util.CopyFiles: srcCount=1
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
 /tmp/hadoop-hadoop/mapred/system/job_200809072254_0005/job.xml: No such
 file
 or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:175)
at
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.submitJob(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:604)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)



Re: specifying number of nodes for job

2008-09-08 Thread Sandy
-smiles- It's not nice to poke fun at people's e-mail aliases... and
snickerdoodles are delicious cookies.

In all seriousness though, why is this not possible? Is there something
about the MapReduce model of parallel computation that I am not
understanding? Or this more of an arbitrary implementation choice made by
the Hadoop framework? If so, I am curious why this is the case. What are the
benefits?

What I'm talking about is not uncommon for scalability studies. Is being
able to specify the number of processors considered a desirable feature by
developers?

Just curious, of course.

Regards,

-SM

On Mon, Sep 8, 2008 at 3:36 PM, [EMAIL PROTECTED] wrote:

 yeah. snickerdoodle. really.


  I see.. so if I have a cluster with n nodes, there is no way for me to
  have
  it spawn on just 2 of those nodes, or just one of those nodes? And
  furthermore, there is no way for me to have it spawn on just a subset of
  the
  processors? Or am I misunderstanding?
 
  Also, when you say specify the number of tasks for each node are you
  referring to specifying the number of mappers and reducers I can spawn on
  each node?
 
  -SM
 
  On Sun, Sep 7, 2008 at 8:29 PM, Mafish Liu [EMAIL PROTECTED] wrote:
 
  On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED]
 wrote:
 
   Hi,
  
   This may be a silly question, but I'm strangely having trouble finding
  an
   answer for it (perhaps I'm looking in the wrong places?).
  
   Suppose I have a cluster with n nodes each with m processors.
  
   I wish to test the performance of, say,  the wordcount program on k
   processors, where k is varied from k = 1 ... nm.
 
 
  You can  specify the number of tasks for each node in your
  hadoop-site.xml
  file.
  So you can get k varied from k = n, 2*nm*n instead of k = 1...nm.
 
 
   How would I do this? I'm having trouble finding the proper command
  line
   option in the commands manual (
   http://hadoop.apache.org/core/docs/current/commands_manual.html)
  
  
  
   Thank you very much for you time.
  
   -SM
  
 
 
 
  --
  [EMAIL PROTECTED]
  Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
 
 





Monthly Hadoop User Group Meeting (Bay Area)

2008-09-08 Thread Ajay Anand
The next Hadoop User Group (Bay Area) meeting is scheduled for
Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa
Clara, CA, Building 2, Training Rooms 34.

 

Agenda:

Cloud Computing Testbed - Thomas Sandholm, HP
Katta on Hadoop - Stefan Groschupf

 

Registration and directions: http://upcoming.yahoo.com/event/1075456/

 

Look forward to seeing you there!

Ajay

 



Re: About the nama of second-namenode!!

2008-09-08 Thread 叶双明
Hoping a best way to solve it!

2008/9/9 Doug Cutting [EMAIL PROTECTED]

 Changing it will unfortunately cause confusion too.  Sigh.  This is why we
 should take time to name things well the first time.

 Doug


 叶双明 wrote:

 Because the name of second-namenode making so much confusing, does the
 hadoop team consider to change it?




-- 
Sorry for my englist!! 明


Re: is SecondaryNameNode in support for the NameNode?

2008-09-08 Thread 叶双明
Thanks for all !

2008/9/9 Konstantin Shvachko [EMAIL PROTECTED]

 NameNodeFailover http://wiki.apache.org/hadoop/NameNodeFailover, with a
 SecondaryNameNode http://wiki.apache.org/hadoop/SecondaryNameNode
 hosted

 I think it is wrong, please correct it.


 You probably look at some cached results. Both pages do not exist.
 The first one was a cause of confusion and was removed.

 Regards,
 --Konstantin

  2008/9/6, Jean-Daniel Cryans [EMAIL PROTECTED]:

 Hi,

 See http://wiki.apache.org/hadoop/FAQ#7 and


 http://hadoop.apache.org/core/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode

 Regards,

 J-D

 On Sat, Sep 6, 2008 at 5:26 AM, ??? [EMAIL PROTECTED] wrote:

  Hi all!

 The NameNode is a Single Point of Failure for the HDFS Cluster. There
 is support for NameNodeFailover, with a SecondaryNameNode hosted on a
 separate machine being able to stand in for the original NameNode if
 it goes down.

 Is it right? is SecondaryNameNode in support  for the NameNode?

 Sorry for my englist!!
 ?






-- 
Sorry for my englist!! 明


Re: Monthly Hadoop User Group Meeting (Bay Area)

2008-09-08 Thread Chris K Wensel


doh, conveniently collides with the GridGain and GridDynamics  
presentations:


http://web.meetup.com/66/calendar/8561664/

On Sep 8, 2008, at 4:55 PM, Ajay Anand wrote:


The next Hadoop User Group (Bay Area) meeting is scheduled for
Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa
Clara, CA, Building 2, Training Rooms 34.



Agenda:

Cloud Computing Testbed - Thomas Sandholm, HP
Katta on Hadoop - Stefan Groschupf



Registration and directions: http://upcoming.yahoo.com/event/1075456/



Look forward to seeing you there!

Ajay





--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-08 Thread 叶双明
I am thinking about that, let datanodes carry additional information about
which file that it belong to, sequence number in the order of blocks, and
how many blocks of the file.

Is it make sense?

2008/9/9 Raghu Angadi [EMAIL PROTECTED]


 There is certainly value in it. But it can not be done with the data
 datanodes currently have.

 The usual way to protect NameNode metadata in Hadoop is to write the
 metadata to two different locations.

 Raghu.


 叶双明 wrote:

 Hi all.

 We all know the importance of NameNode for a cluster, when the NameNode
 whith all matadata of the DFS is breaked down,  the whole of the
 clustes'data is gone.

 So, if we can retrive DFS metadata from datanodes, It should be a
 additional
 great robustness for hadoop.

 Does any one thinking It's valuable for it?





-- 
Sorry for my english!! 明


Re: Logging best practices?

2008-09-08 Thread Amareshwari Sriramadasu

Per Jacobsson wrote:

Hi all.
I've got a beginner question: Are there any best practices for how to do
logging from a task? Essentially I want to log warning messages under
certain conditions in my map and reduce tasks, and be able to review them
later.

  
stdout, stderr and the logs using commons-logging  from the task are 
stored in userlogs directory i.e. ${hadoop.log.dir}/userlogs/taskid . 
They are also available on the web UI.

Is good old commons-logging using the TaskLogAppender the best way to solve
this? 

I think using commons-logging is good.

I assume I'd have to configure it to log to StdErr to be able to see
the log messages in the jobtracker webapp. The Reporter would be useful to
track statistics but not for something like this. And the JobHistory class
and history logs are intended for internal use only?

  
JobHistory class is for internal use only. But the history logs can be 
viewed from the web UI and HistoryViewer.

Thanks a lot,
Per

  


Thanks
Amareshwari


Re: Failing MR jobs!

2008-09-08 Thread Shengkai Zhu
Do you have some more detailed information? Logs are helpful.

On Mon, Sep 8, 2008 at 3:26 AM, Erik Holstad [EMAIL PROTECTED] wrote:

 Hi!
 I'm trying to run a MR job, but it keeps on failing and I can't understand
 why.
 Sometimes it shows output at 66% and sometimes 98% or so.
 I had a couple of exception before that I didn't catch that made the job to
 fail.


 The log file from the task can be found at:
 http://pastebin.com/m4414d369


 and the code looks like:
 //Java
 import java.io.*;
 import java.util.*;
 import java.net.*;

 //Hadoop
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
 import org.apache.hadoop.util.*;

 //HBase
 import org.apache.hadoop.hbase.*;
 import org.apache.hadoop.hbase.mapred.*;
 import org.apache.hadoop.hbase.io.*;
 import org.apache.hadoop.hbase.client.*;
 // org.apache.hadoop.hbase.client.HTable

 //Extra
 import org.apache.commons.cli.ParseException;

 import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
 import org.apache.commons.httpclient.*;
 import org.apache.commons.httpclient.methods.*;
 import org.apache.commons.httpclient.params.HttpMethodParams;


 public class SerpentMR1 extends TableMap implements Mapper, Tool {

//Setting DebugLevel
private static final int DL = 0;

//Setting up the variables for the MR job
private static final String NAME = SerpentMR1;
private static final String INPUTTABLE = sources;
private final String[] COLS = {content:feedurl, content:ttl,
 content:updated};


private Configuration conf;

public JobConf createSubmittableJob(String[] args) throws IOException{
JobConf c = new JobConf(getConf(), SerpentMR1.class);
String jar = /home/hbase/SerpentMR/ +NAME+.jar;
c.setJar(jar);
c.setJobName(NAME);

int mapTasks = 4;
int reduceTasks = 20;

c.setNumMapTasks(mapTasks);
c.setNumReduceTasks(reduceTasks);

String inputCols = ;
for (int i=0; iCOLS.length; i++){inputCols += COLS[i] +  ; }

TableMap.initJob(INPUTTABLE, inputCols, this.getClass(), Text.class,
 BytesWritable.class, c);
//Classes between:

c.setOutputFormat(TextOutputFormat.class);
Path path = new Path(users); //inserting into a temp table
FileOutputFormat.setOutputPath(c, path);

c.setReducerClass(MyReducer.class);
return c;
}

public void map(ImmutableBytesWritable key, RowResult res,
 OutputCollector output, Reporter reporter)
throws IOException {
Cell cellLast= res.get(COLS[2].getBytes());//lastupdate

long oldTime = cellLast.getTimestamp();

Cell cell_ttl= res.get(COLS[1].getBytes());//ttl
long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() );
byte[] url = null;

long currTime = time.GetTimeInMillis();

if(currTime - oldTime  ttl){
url = res.get(COLS[0].getBytes()).getValue();//url
output.collect(new Text(Base64.encode_strip(res.getRow())), new
 BytesWritable(url) );/
}
}



public static class MyReducer implements Reducer{
 //org.apache.hadoop.mapred.Reducer{


private int timeout = 1000; //Sets the connection timeout time ms;

public void reduce(Object key, Iterator values, OutputCollector
 output, Reporter rep)
throws IOException {
HttpClient client = new HttpClient();//new
 MultiThreadedHttpConnectionManager());
client.getHttpConnectionManager().
getParams().setConnectionTimeout(timeout);

GetMethod method = null;

int stat = 0;
String content = ;
byte[] colFam = select.getBytes();
byte[] column = lastupdate.getBytes();
byte[] currTime = null;

HBaseRef hbref = new HBaseRef();
JerlType sendjerl = null; //new JerlType();
ArrayList jd = new ArrayList();

InputStream is = null;

while(values.hasNext()){
BytesWritable bw = (BytesWritable)values.next();

String address = new String(bw.get());
try{
System.out.println(address);

method = new GetMethod(address);
method.setFollowRedirects(true);

} catch (Exception e){
System.err.println(Invalid Address);
e.printStackTrace();
}

if (method != null){
try {
// Execute the method.
stat = client.executeMethod(method);

if(stat == 200){
content = ;
is =
 (InputStream)(method.getResponseBodyAsStream());

//Write to HBase new stamp 

Re: Failing MR jobs!

2008-09-08 Thread Shengkai Zhu
Sorry, I didn't see the log link.

On Tue, Sep 9, 2008 at 12:01 PM, Shengkai Zhu [EMAIL PROTECTED] wrote:


 Do you have some more detailed information? Logs are helpful.


 On Mon, Sep 8, 2008 at 3:26 AM, Erik Holstad [EMAIL PROTECTED]wrote:

 Hi!
 I'm trying to run a MR job, but it keeps on failing and I can't understand
 why.
 Sometimes it shows output at 66% and sometimes 98% or so.
 I had a couple of exception before that I didn't catch that made the job
 to
 fail.


 The log file from the task can be found at:
 http://pastebin.com/m4414d369


 and the code looks like:
 //Java
 import java.io.*;
 import java.util.*;
 import java.net.*;

 //Hadoop
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
 import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*;
 import org.apache.hadoop.util.*;

 //HBase
 import org.apache.hadoop.hbase.*;
 import org.apache.hadoop.hbase.mapred.*;
 import org.apache.hadoop.hbase.io.*;
 import org.apache.hadoop.hbase.client.*;
 // org.apache.hadoop.hbase.client.HTable

 //Extra
 import org.apache.commons.cli.ParseException;

 import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
 import org.apache.commons.httpclient.*;
 import org.apache.commons.httpclient.methods.*;
 import org.apache.commons.httpclient.params.HttpMethodParams;


 public class SerpentMR1 extends TableMap implements Mapper, Tool {

//Setting DebugLevel
private static final int DL = 0;

//Setting up the variables for the MR job
private static final String NAME = SerpentMR1;
private static final String INPUTTABLE = sources;
private final String[] COLS = {content:feedurl, content:ttl,
 content:updated};


private Configuration conf;

public JobConf createSubmittableJob(String[] args) throws IOException{
JobConf c = new JobConf(getConf(), SerpentMR1.class);
String jar = /home/hbase/SerpentMR/ +NAME+.jar;
c.setJar(jar);
c.setJobName(NAME);

int mapTasks = 4;
int reduceTasks = 20;

c.setNumMapTasks(mapTasks);
c.setNumReduceTasks(reduceTasks);

String inputCols = ;
for (int i=0; iCOLS.length; i++){inputCols += COLS[i] +  ; }

TableMap.initJob(INPUTTABLE, inputCols, this.getClass(),
 Text.class,
 BytesWritable.class, c);
//Classes between:

c.setOutputFormat(TextOutputFormat.class);
Path path = new Path(users); //inserting into a temp table
FileOutputFormat.setOutputPath(c, path);

c.setReducerClass(MyReducer.class);
return c;
}

public void map(ImmutableBytesWritable key, RowResult res,
 OutputCollector output, Reporter reporter)
throws IOException {
Cell cellLast= res.get(COLS[2].getBytes());//lastupdate

long oldTime = cellLast.getTimestamp();

Cell cell_ttl= res.get(COLS[1].getBytes());//ttl
long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() );
byte[] url = null;

long currTime = time.GetTimeInMillis();

if(currTime - oldTime  ttl){
url = res.get(COLS[0].getBytes()).getValue();//url
output.collect(new Text(Base64.encode_strip(res.getRow())), new
 BytesWritable(url) );/
}
}



public static class MyReducer implements Reducer{
 //org.apache.hadoop.mapred.Reducer{


private int timeout = 1000; //Sets the connection timeout time ms;

public void reduce(Object key, Iterator values, OutputCollector
 output, Reporter rep)
throws IOException {
HttpClient client = new HttpClient();//new
 MultiThreadedHttpConnectionManager());
client.getHttpConnectionManager().
getParams().setConnectionTimeout(timeout);

GetMethod method = null;

int stat = 0;
String content = ;
byte[] colFam = select.getBytes();
byte[] column = lastupdate.getBytes();
byte[] currTime = null;

HBaseRef hbref = new HBaseRef();
JerlType sendjerl = null; //new JerlType();
ArrayList jd = new ArrayList();

InputStream is = null;

while(values.hasNext()){
BytesWritable bw = (BytesWritable)values.next();

String address = new String(bw.get());
try{
System.out.println(address);

method = new GetMethod(address);
method.setFollowRedirects(true);

} catch (Exception e){
System.err.println(Invalid Address);
e.printStackTrace();
}

if (method != null){
try {
// Execute the method.
stat = client.executeMethod(method);

if(stat == 200){
content = ;

Issue in reduce phase with SortedMapWritable and custom Writables as values

2008-09-08 Thread Ryan LeCompte
Hello,

I'm attempting to use a SortedMapWritable with a LongWritable as the
key and a custom implementation of org.apache.hadoop.io.Writable as
the value. I notice that my program works fine when I use another
primitive wrapper (e.g. Text) as the value, but fails with the
following exception when I use my custom Writable instance:

2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask:
Initiating in-memory merge with 1 segments...
2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging
1 sorted segments
2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 1 segments left of total size: 5492 bytes
2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw
a
n exception: java.io.IOException: Intermedate merge failed
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at 
org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179)
...

I noticed that the AbstractMapWritable class has a protected
addToMap(Class clazz) method. Do I somehow need to let my
SortedMapWritable instance know about my custom Writable value? I've
properly implemented the custom Writable object (it just contains a
few primitives, like longs and ints).

Any insight is appreciated.

Thanks,
Ryan