Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Otis Gospodnetic

This is how things get into Apache Incubator: http://incubator.apache.org/
But the rules are, I believe, that you can skip the incubator and go straight 
under a project's wing (e.g. Hadoop) if the project PMC approves.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Tarandeep Singh 
> To: core-user@hadoop.apache.org
> Sent: Wednesday, April 15, 2009 1:08:38 PM
> Subject: How to submit a project to Hadoop/Apache
> 
> Hi,
> 
> Can anyone point me to a documentation which explains how to submit a
> project to Hadoop as a subproject? Also, I will appreciate if someone points
> me to the documentation on how to submit a project as Apache project.
> 
> We have a project that is built on Hadoop. It is released to the open source
> community under GPL license but we are thinking of submitting it as a Hadoop
> or Apache project. Any help on how to do this is appreciated.
> 
> Thanks,
> Tarandeep



Re: Job Tracker/Name Node redundancy

2009-01-09 Thread Otis Gospodnetic
Yes, there is a JIRA issue for a redundant JobTracker already.
The NN redundancy scenario is mentioned on the Wiki (look for 
SecondaryNameNode).

Otis

 --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Ryan LeCompte 
> To: "core-user@hadoop.apache.org" 
> Sent: Friday, January 9, 2009 3:09:13 PM
> Subject: Job Tracker/Name Node redundancy
> 
> Are there any plans to build redundancy/failover support for the Job
> Tracker and Name Node components in Hadoop? Let's take the current
> scenario:
> 
> 1) A data/cpu intensive job is submitted to a Hadoop cluster of 10 machines.
> 2) Half-way through the job execution, the Job Tracker or Name Node fails.
> 3) We bring up a new Job Tracker or Name Node manually.
> 
> -- Will the individual task trackers / data nodes "reconnect" to the
> new masters? Or will the job have to be resubmitted? If we had
> failover support, we could setup essentially 3 Job Tracker masters and
> 3 NameNode masters so that if one dies the other would gracefully take
> over and start handling results from the children nodes.
> 
> Thanks!
> 
> Ryan



Re: Hadoop Development Status

2008-11-20 Thread Otis Gospodnetic
Question for Alex:

Are you going to be releasing this tool?  I'm sure my friends over at 
Lucene/Solr/Nutch/etc. would love to see their project's info presented in the 
same fashion. :)


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Konstantin Shvachko <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, November 20, 2008 1:41:20 PM
Subject: Re: Hadoop Development Status

This is very nice.
A suggestion if it is related to the development status.
Do you think guys you can analyze which questions are
discussed most often in the mailing lists, so that we could
update our FAQs based on that.
Thanks,
--Konstantin


Alex Loddengaard wrote:
> Some engineers here at Cloudera have been working on a website to report on
> Hadoop development status, and we're happy to announce that the website is
> now available!  We've written a blog post describing its usefulness, goals,
> and future, so take a look if you're interested:
> 
> <
> http://www.cloudera.com/blog/2008/11/18/introducing-hadoop-development-status/
> 
> The tool is hosted here:
> 
> 
> 
> Please give us any feedback or suggestions off-list, to avoid polluting the
> list.
> 
> Enjoy!
> 
> Alex, Jeff, and Tom
> 


Re: SecondaryNameNode on separate machine

2008-10-30 Thread Otis Gospodnetic
Konstantin & Co, please correct me if I'm wrong, but looking at 
hadoop-default.xml makes me think that dfs.http.address is only the URL for the 
NN *Web UI*.  In other words, this is where we people go look at the NN.

The secondary NN must then be using only the Primary NN URL specified in 
fs.default.name.  This URL looks like hdfs://name-node-hostname-here/.  
Something in Hadoop then knows the exact port for the Primary NN based on the 
URI schema (e.g. "hdfs://") in this URL.

Is this correct?


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Tomislav Poljak <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, October 30, 2008 1:52:18 PM
> Subject: Re: SecondaryNameNode on separate machine
> 
> Hi,
> can you, please, explain the difference between fs.default.name and
> dfs.http.address (like how and when is SecondaryNameNode using
> fs.default.name and how/when dfs.http.address). I have set them both to
> same (namenode's) hostname:port. Is this correct (or dfs.http.address
> needs some other port)? 
> 
> Thanks,
> 
> Tomislav
> 
> On Wed, 2008-10-29 at 16:10 -0700, Konstantin Shvachko wrote:
> > SecondaryNameNode uses http protocol to transfer the image and the edits
> > from the primary name-node and vise versa.
> > So the secondary does not access local files on the primary directly.
> > The primary NN should know the secondary's http address.
> > And the secondary NN need to know both fs.default.name and dfs.http.address 
> > of 
> the primary.
> > 
> > In general we usually create one configuration file hadoop-site.xml
> > and copy it to all other machines. So you don't need to set up different
> > values for all servers.
> > 
> > Regards,
> > --Konstantin
> > 
> > Tomislav Poljak wrote:
> > > Hi,
> > > I'm not clear on how does SecondaryNameNode communicates with NameNode
> > > (if deployed on separate machine). Does SecondaryNameNode uses direct
> > > connection (over some port and protocol) or is it enough for
> > > SecondaryNameNode to have access to data which NameNode writes locally
> > > on disk?
> > > 
> > > Tomislav
> > > 
> > > On Wed, 2008-10-29 at 09:08 -0400, Jean-Daniel Cryans wrote:
> > >> I think a lot of the confusion comes from this thread :
> > >> http://www.nabble.com/NameNode-failover-procedure-td11711842.html
> > >>
> > >> Particularly because the wiki was updated with wrong information, not
> > >> maliciously I'm sure. This information is now gone for good.
> > >>
> > >> Otis, your solution is pretty much like the one given by Dhruba Borthakur
> > >> and augmented by Konstantin Shvachko later in the thread but I never did 
> > >> it
> > >> myself.
> > >>
> > >> One thing should be clear though, the NN is and will remain a SPOF (just
> > >> like HBase's Master) as long as a distributed manager service (like
> > >> Zookeeper) is not plugged into Hadoop to help with failover.
> > >>
> > >> J-D
> > >>
> > >> On Wed, Oct 29, 2008 at 2:12 AM, Otis Gospodnetic <
> > >> [EMAIL PROTECTED]> wrote:
> > >>
> > >>> Hi,
> > >>> So what is the "recipe" for avoiding NN SPOF using only what comes with
> > >>> Hadoop?
> > >>>
> > >>> From what I can tell, I think one has to do the following two things:
> > >>>
> > >>> 1) configure primary NN to save namespace and xa logs to multiple dirs, 
> one
> > >>> of which is actually on a remotely mounted disk, so that the data 
> > >>> actually
> > >>> lives on a separate disk on a separate box.  This saves namespace and xa
> > >>> logs on multiple boxes in case of primary NN hardware failure.
> > >>>
> > >>> 2) configure secondary NN to periodically merge fsimage+edits and create
> > >>> the fsimage checkpoint.  This really is a second NN process running on
> > >>> another box.  It sounds like this secondary NN has to somehow have 
> > >>> access 
> to
> > >>> fsimage & edits files from the primary NN server.
> > >>> 
> http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNodedoes
>  
> not describe the best practise around that - the recommended way to
> > >>> give secondary NN access to primary NN

Re: Integration with compute cluster

2008-10-29 Thread Otis Gospodnetic
Hi,

You want to store your logs in HDFS (by copying them from your production 
machines, presumably) and then write custom MapReduce jobs that know how to 
process, correlate data in the logs, and output data in some format that suits 
you.  What you do with that output is then up to you.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: shahab mehmandoust <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, October 29, 2008 7:29:35 PM
> Subject: Integration with compute cluster
> 
> Hi,
> 
> We have one prod server with web logs and a db server.  We want to correlate
> the data in the logs and the db.  With a hadoop implementation (for scaling
> up later), do we need to transfer the data to a machine (designated as the
> compute cluster: http://hadoop.apache.org/core/images/architecture.gif), run
> map/reduce there, and then transfer the output elsewhere for our analysis?
> 
> I'm confused about the compute cluster; does it encompass the data sources
> (here the prod server and the db)?
> 
> Thanks,
> Shahab



Re: SecondaryNameNode on separate machine

2008-10-28 Thread Otis Gospodnetic
Hi,
So what is the "recipe" for avoiding NN SPOF using only what comes with Hadoop?

>From what I can tell, I think one has to do the following two things:

1) configure primary NN to save namespace and xa logs to multiple dirs, one of 
which is actually on a remotely mounted disk, so that the data actually lives 
on a separate disk on a separate box.  This saves namespace and xa logs on 
multiple boxes in case of primary NN hardware failure.

2) configure secondary NN to periodically merge fsimage+edits and create the 
fsimage checkpoint.  This really is a second NN process running on another box. 
 It sounds like this secondary NN has to somehow have access to fsimage & edits 
files from the primary NN server.  
http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode
 does not describe the best practise around that - the recommended way to give 
secondary NN access to primary NN's fsimage and edits files.  Should one mount 
a disk from the primary NN box to the secondary NN box to get access to those 
files?  Or is there a simpler way?
In any case, this checkpoint is just a merge of fsimage+edits files and again 
is there in case the box with the primary NN dies.  That's what's described on 
http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode
 more or less.

Is this sufficient, or are there other things one has to do to eliminate NN 
SPOF?


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Jean-Daniel Cryans <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, October 28, 2008 8:14:44 PM
> Subject: Re: SecondaryNameNode on separate machine
> 
> Tomislav.
> 
> Contrary to popular belief the secondary namenode does not provide failover,
> it's only used to do what is described here :
> http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode
> 
> So the term "secondary" does not mean "a second one" but is more like "a
> second part of".
> 
> J-D
> 
> On Tue, Oct 28, 2008 at 9:44 AM, Tomislav Poljak wrote:
> 
> > Hi,
> > I'm trying to implement NameNode failover (or at least NameNode local
> > data backup), but it is hard since there is no official documentation.
> > Pages on this subject are created, but still empty:
> >
> > http://wiki.apache.org/hadoop/NameNodeFailover
> > http://wiki.apache.org/hadoop/SecondaryNameNode
> >
> > I have been browsing the web and hadoop mailing list to see how this
> > should be implemented, but I got even more confused. People are asking
> > do we even need SecondaryNameNode etc. (since NameNode can write local
> > data to multiple locations, so one of those locations can be a mounted
> > disk from other machine). I think I understand the motivation for
> > SecondaryNameNode (to create a snapshoot of NameNode data every n
> > seconds/hours), but setting (deploying and running) SecondaryNameNode on
> > different machine than NameNode is not as trivial as I expected. First I
> > found that if I need to run SecondaryNameNode on other machine than
> > NameNode I should change masters file on NameNode (change localhost to
> > SecondaryNameNode host) and set some properties in hadoop-site.xml on
> > SecondaryNameNode (fs.default.name, fs.checkpoint.dir,
> > fs.checkpoint.period etc.)
> >
> > This was enough to start SecondaryNameNode when starting NameNode with
> > bin/start-dfs.sh , but it didn't create image on SecondaryNameNode. Then
> > I found that I need to set dfs.http.address on NameNode address (so now
> > I have NameNode address in both fs.default.name and dfs.http.address).
> >
> > Now I get following exception:
> >
> > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary - Exception in
> > doCheckpoint:
> > 2008-10-28 09:18:00,098 ERROR NameNode.Secondary -
> > java.net.SocketException: Unexpected end of file from server
> >
> > My questions are following:
> > How to resolve this problem (this exception)?
> > Do I need additional property in SecondaryNameNode's hadoop-site.xml or
> > NameNode's hadoop-site.xml?
> >
> > How should NameNode failover work ideally? Is it like this:
> >
> > SecondaryNameNode runs on separate machine than NameNode and stores
> > NameNode's data (fsimage and fsiedits) locally in fs.checkpoint.dir.
> > When NameNode machine crashes, we start NameNode on machine where
> > SecondaryNameNode was running and we set dfs.name.dir to
> > fs.checkpoint.dir. Also we need to change how DNS resolves NameNode
> > hostname (change from the primary to the secondary).
> >
> > Is this correct ?
> >
> > Tomislav
> >
> >
> >



Re: HDFS Vs KFS

2008-08-21 Thread Otis Gospodnetic
Isn't there FUSE for HDFS, as well as the WebDAV option?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Tim Wintle <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, August 21, 2008 1:42:51 PM
> Subject: Re: HDFS Vs KFS
> 
> I haven't used KFS, but I believe a major difference is that you can
> (apparently) mount KFS as a standard device under Linux, allowing you to
> read and write directly to it without having to re-compile the
> application (as far as I know that's not possible with HDFS, although
> the last time I installed HDFS was 0.16)
> 
> ... It is definitely much newer.
> 
> 
> On Fri, 2008-08-22 at 01:35 +0800, rae l wrote:
> > On Fri, Aug 22, 2008 at 12:34 AM, Wasim Bari wrote:
> > >
> > > KFS is also another Distributed file system implemented in C++. Here you 
> > > can
> > > get details:
> > >
> > > http://kosmosfs.sourceforge.net/
> > 
> > Just from the basic information:
> > 
> > http://sourceforge.net/projects/kosmosfs
> > 
> > # Developers : 2
> > # Development Status : 3 - Alpha
> > # Intended Audience : Developers
> > # Registered : 2007-08-30 21:05
> > 
> > and from the history of subversion repository:
> > 
> > http://kosmosfs.svn.sourceforge.net/viewvc/kosmosfs/trunk/
> > 
> > I think it's just not stable and not widely used as HDFS:
> > 
> > * HDFS is stable and production level available.
> > 
> > This maybe not totally right and I'm waiting someone more familiar to
> > KFS to talk about this.



Re: lucene/nutch question...

2008-08-14 Thread Otis Gospodnetic
Bruce, you may want to ask on [EMAIL PROTECTED] or [EMAIL PROTECTED] lists, or 
even [EMAIL PROTECTED]


Yes, it sounds like either Lucene or Solr might be the right tools to use.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: bruce <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, August 14, 2008 4:16:28 PM
> Subject: lucene/nutch question...
> 
> Hi.
> 
> Got a very basic lucene/nutch question.
> 
> Assume I have a page that has a form. Within the form are a number of
> select/drop-down boxes/etc... In this case, each object would comprise a
> variable which would form part of the query string as defined in the form
> action. Is there a way for lucene/nutch to go through the process of
> building up the actions based on the querystring vars, so that lucene/nutch
> can actually search through each possible combination of urls
> 
> Also, is nutch/lucene the right/correct app to use in this scenario? Is
> there a better app to handle this kind of potential application/process.
> 
> Thanks
> 
> -bruce




Re: Configuration: I need help.

2008-08-06 Thread Otis Gospodnetic
Hi James,

You can put the same hadoop-site.xml on all machines.  Yes, you do want a 
secondary NN - a single NN is a SPOF.  Browser the archives a few days back to 
find an email from Paul about DRBD (disk replication) to avoid this SPOF.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: James Graham (Greywolf) <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, August 6, 2008 1:37:20 PM
> Subject: Configuration: I need help.
> 
> Seeing as there is no search function on the archives, I'm relegated
> to asking a possibly redundant question or four:
> 
> I have, as a sample setup:
> 
> idx1-trackerJobTracker
> idx2-namenode   NameNode
> idx3-slave  DataTracker
> ...
> idx20-slaveDataTracker
> 
> Q1: Can I put the same hadoop-site.xml file on all machines or do I need
>  to configure each machine separately?
> 
> Q2: My current setup does not seem to find a primary namenode, but instead
>  wants to put idx1 and idx2 as secondary namenodes; as a result, I am
>  not getting anything usable on any of the web addresses (50030, 
> 50050,
>  50070, 50090).
> 
> Q3: Possibly connected to Q1:  The current setup seems to go out and start
>  on all machines (masters/slaves); when I say "bin/start-mapred.sh" on
>  the JobTracker, I get the answer "jobtracker running...kill it 
> first".
> 
> Q4: Do I even *need* a secondary namenode?
> 
> IWBN if I did not have to maintain three separate configuration files
> (jobtracker/namenode/datatracker).
> -- 
> James Graham (Greywolf)  |
> 650.930.1138|925.768.4053  *
> [EMAIL PROTECTED]  |
> Check out what people are saying about SearchMe! -- click below
> http://www.searchme.com/stack/109aa



Re: corrupted fsimage and edits

2008-08-01 Thread Otis Gospodnetic
I had the same thing happen to me a few weeks ago.  The solution was to modify 
one of the classes a bit (FSEdits.java or some such) and simple catch + swallow 
one of the exceptions.  This let the NN come up again (at the expense of some 
data loss).  Lohit helped me out and files a bug.  Don't have the issue number 
handy, but it is in JIRA and still open as of a few days ago.  NN HA seems to 
be a requirement for a lot of people... I suppose because it's (the only?) 
SPOF. :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Torsten Curdt <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, July 30, 2008 2:09:15 PM
> Subject: corrupted fsimage and edits
> 
> Just a bit of a feedback here.
> 
> One of our hadoop 0.16.4 namenodes had gotten a disk full incident  
> today. No second backup namenode was in place. Both files fsimage and  
> edits seem to have gotten corrupted. After quite a bit of debugging  
> and fiddling with a hex edtor we managed to resurrect the files and  
> continue with just minor loss.
> 
> Thankfully this only happened on a development cluster - not on  
> production. But shouldn't that be something that should NEVER happen?
> 
> cheers
> --
> Torsten



Re: Multiple master nodes

2008-08-01 Thread Otis Gospodnetic
I've been wondering about DRBD.  Many (5+?) years ago when I looked at DRBD it 
required too much low-level tinkering and required hardware I did not have.  I 
wonder what it takes to set it up now and if there are any Hadoop-specific 
things you needed to do?  Overall, are you happy with DRBD? (you are limited to 
2 nodes, right?)


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: paul <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, July 29, 2008 2:56:44 PM
> Subject: Re: Multiple master nodes
> 
> I'm currently running with your option B setup and it seems to be reliable
> for me (so far).  I use a combination of drbd and various hearbeat/LinuxHA
> scripts that handle the failover process, including a virtual IP for the
> namenode.  I haven't had any real-world unexpected failures to deal with,
> yet, but all manual testing has had consistent and reliable results.
> 
> 
> 
> -paul
> 
> 
> On Tue, Jul 29, 2008 at 1:54 PM, Ryan Shih wrote:
> 
> > Dear Hadoop Community --
> >
> > I am wondering if it is already possible or in the plans to add capability
> > for multiple master nodes. I'm in a situation where I have a master node
> > that may potentially be in a less than ideal execution and networking
> > environment. For this reason, it's possible that the master node could die
> > at any time. On the other hand, the application must always be available. I
> > have accessible to me other machines but I'm still unclear on the best
> > method to add reliability.
> >
> > Here are a few options that I'm exploring:
> > a) To create a completely secondary Hadoop cluster that we can flip to when
> > we detect that the master node has died. This will double hardware costs,
> > so
> > if we originally have a 5 node cluster, then we would need to pull 5 more
> > machines out of somewhere for this decision. This is not the preferable
> > choice.
> > b) Just mirror the master node via other always available software, such as
> > DRBD for real time synchronization. Upon detection we could swap to the
> > alternate node.
> > c) Or if Hadoop had some functionality already in place, it would be
> > fantastic to be able to take advantage of that. I don't know if anything
> > like this is available but I could not find anything as of yet. It seems to
> > me, however, that having multiple master nodes would be the direction
> > Hadoop
> > needs to go if it is to be useful in high availability applications. I was
> > told there are some papers on Amazon's Elastic Computing that I'm about to
> > look for that follow this approach.
> >
> > In any case, could someone with experience in solving this type of problem
> > share how they approached this issue?
> >
> > Thanks!
> >



OK to remove NN's edits file?

2008-07-07 Thread Otis Gospodnetic
Hello,

I have Hadoop 0.16.2 running in a cluster whose Namenode seems to have a 
corrupt "edits" file.  This causes an EOFException during NN init, which causes 
NN to exit immediately (exception below).

What is the recommended thing to do in such a case?

I don't mind losing any of the data that is referenced in "edits" file.
Should I just remove the edits file, start NN, and assume the NN will create a 
new, empty "edits" file and all will be well?

This is what I see when NN tries to start:

2008-07-07 10:58:43,255 ERROR dfs.NameNode - java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-06 Thread Otis Gospodnetic
)
task_200806101759_0370_m_76_0:  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.run(MapTask.java:439)
task_200806101759_0370_m_76_0: Caused by: java.io.IOException: No space 
left on device
task_200806101759_0370_m_76_0:  at 
java.io.FileOutputStream.writeBytes(Native Method)
task_200806101759_0370_m_76_0:  at 
java.io.FileOutputStream.write(FileOutputStream.java:260)
task_200806101759_0370_m_76_0: 
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:169)
task_200806101759_0370_m_76_0:  ... 17 more

The weird thing is that none of the nodes in the cluster are out of disk space!
(but
maybe that's because one of the nodes really ran out of disk
temporarily, and then some files got cleaned up, so now disks are no
longer full)

$ for h in `cat ~/nutch/conf/slaves`; do echo $h; ssh $h df -h | grep mnt; done;
localhost  this is the NN
/dev/sdb  414G  384G  9.1G  98% /mnt
/dev/sdc  414G   91G  302G  24% /mnt2
10.252.93.155
/dev/sdb  414G  389G  3.7G 100% /mnt  // but it has 3.7GB free!
/dev/sdc  414G   93G  300G  24% /mnt2
10.252.239.63
/dev/sdb  414G  388G  4.5G  99% /mnt
/dev/sdc  414G   90G  303G  23% /mnt2
10.251.235.224
/dev/sdb  414G  362G   32G  93% /mnt
/dev/sdc  414G   92G  301G  24% /mnt2
10.252.230.32
/dev/sdb  414G  189G  205G  48% /mnt
/dev/sdc  414G  183G  210G  47% /mnt2


The
error when I try starting NN now is an EOFException.  Is there
something that tells Hadoop how many records to read from edits file? 
If so, then if that number is greater than the number of records in the
edits file, then I don't think I'll be able to fix the problem by
removing lines from the edits file.  No?

I don't mind losing
*some* data in HDFS.  Can I just remove edits file completely and
assume that NN will start (even though it won't know about some portion
of data in HDFS which I assume I would then have to find and
remove/clean up somehow myself)?

I've been running Hadoop for
several months now.  The first records in the "edits" file seem to be
from 2008-06-10 and most of the records seem to be from June 10, while
I started seeing errors in the logs on June 23.  Here are some details:

## 74K lines
[EMAIL PROTECTED] logs]$ wc -l /mnt/nutch/filesystem/name/current/edits
74403 /mnt/nutch/filesystem/name/current/edits

## 454K lines of "strings"
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | wc 
-l
454271

## Nothing from before June 10
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 2008060 | wc -l
0

## 139K lines from June (nothing from before June 10)
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | wc -l
139524

## most of the records are from June 10, seems related to those problematic 
tasks from 20080610
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | grep -c 20080610
130519

## not much from June 11 (12th, 13th, and so on)
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 20080611 | wc -l
1834

## the last few non June 10th lines with the string "200806" in them.  I think 
27th is when Hadoop completely died.
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | grep -v 20080610 | tail
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00012
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00013
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00014
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00011
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-5
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-4
F/user/otis/crawl/segments/20080627171831/crawl_generate/_temporary
;/user/otis/crawl/segments/20080627171831/crawl_generate
;/user/otis/crawl/segments/20080627171831/crawl_generate
7/user/otis/crawl/segments/20080627171831/_temporary

## the last few lines
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | tail
67108864
otis
supergroup
f/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch
1214687074224
otis
supergroup
q/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch/part-4
1214687074224
otis

Sorry for the long email.  Maybe you will see something in all this.  Any help 
would be greatly appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Saturday, July 5, 2008 10:36:37 AM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> Hm

Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-05 Thread Otis Gospodnetic
Hm, tried it (simply edited with vi, removed the last line).  I did it with 
both edits file and fsimage file (I see references to FSEditLog.java and 
FSImage.java in the stack trace below), but that didn't seem to help.  Namenode 
just doesn't start at all.  I can't see any errors in logs due to a failed 
startup, I just see that exception below when I actually try using Hadoop.

What's the damage if I simply remove all of their content? (a la > edits and > 
fsimage)


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Saturday, July 5, 2008 10:08:57 AM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> I remember dhruba telling me about this once.
> Yes, Take a backup of the whole current directory.
> As you have seen, remove the last line from edits and try to start the 
> NameNode. 
> 
> If it starts, then run fsck to find out which file had the problem. 
> Thanks,
> Lohit
> 
> - Original Message 
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Friday, July 4, 2008 4:46:57 PM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> Hi,
> 
> If it helps with the problem below -- I don't mind losing some data.
> For instance, I see my "edits" file has about 74K lines.
> Can I just nuke the edits file or remove the last N lines?
> 
> I am looking at the edits file with vi and I see the very last line is very 
> short - it looks like it was cut off, incomplete, and some of the logs do 
> mention running out of disk space (even though the NN machine has some more 
> free 
> space).
> 
> Could I simply remove this last incomplete line?
> 
> Any help would be greatly appreciated.
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, July 4, 2008 2:00:58 AM
> > Subject: ERROR dfs.NameNode - java.io.EOFException
> > 
> > Hi,
> > 
> > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> > 
> > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> > at java.io.DataInputStream.readFully(DataInputStream.java:180)
> > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> > at 
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> > at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> > at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> > at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> > at 
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> > at 
> > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> > at 
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> > at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> > at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> > at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> > at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> > at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> > at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> > 
> > The exception doesn't include the name and location of the file whose 
> > reading 
> is 
> > failing and causing EOFException :(
> > But it looks like it's the fsedit log (the "edits" file, I think).
> > 
> > There is no secondary NN in the cluster.
> > 
> > Is there any way I can revive this NN?  Any way to "fix" the corrupt 
> > "edits" 
> > file?
> > 
> > Thanks,
> > Otis



Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-04 Thread Otis Gospodnetic
Hi,

If it helps with the problem below -- I don't mind losing some data.
For instance, I see my "edits" file has about 74K lines.
Can I just nuke the edits file or remove the last N lines?

I am looking at the edits file with vi and I see the very last line is very 
short - it looks like it was cut off, incomplete, and some of the logs do 
mention running out of disk space (even though the NN machine has some more 
free space).

Could I simply remove this last incomplete line?

Any help would be greatly appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message ----
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, July 4, 2008 2:00:58 AM
> Subject: ERROR dfs.NameNode - java.io.EOFException
> 
> Hi,
> 
> Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> 
> 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> at 
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> at 
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> at 
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> 
> The exception doesn't include the name and location of the file whose reading 
> is 
> failing and causing EOFException :(
> But it looks like it's the fsedit log (the "edits" file, I think).
> 
> There is no secondary NN in the cluster.
> 
> Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits" 
> file?
> 
> Thanks,
> Otis



ERROR dfs.NameNode - java.io.EOFException

2008-07-03 Thread Otis Gospodnetic
Hi,

Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:

2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

The exception doesn't include the name and location of the file whose reading 
is failing and causing EOFException :(
But it looks like it's the fsedit log (the "edits" file, I think).

There is no secondary NN in the cluster.

Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits" 
file?

Thanks,
Otis


Re: Monthly user group meeting

2008-06-05 Thread Otis Gospodnetic
Aha, I see, I see, the videos were added to http://research.yahoo.com/node/2104 
.  When I checked that page last time there were only slides there.  Thanks.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Chris Doherty <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, June 5, 2008 10:17:45 AM
> Subject: Re: Monthly user group meeting
> 
> On Wed, Jun 04, 2008 at 07:56:45PM -0700, Otis Gospodnetic said: 
> > The videos from the Hadoop summit are still not available:
> > 
> http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_video.html
> > 
> > And at this point it looks like they never will be available :(
> 
> I followed the link on that page to http://research.yahoo.com/node/2104
> and was able to start watching the videos for Hive and Zookeeper, and the
> Hadoop overview, on OS X in Firefox 2 and Webkit/Safari.
> 
> Does it produce errors for other people?
> 
> Chris
> 
> 
> 
> > - Original Message 
> > > From: Ajay Anand 
> > > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]
> > > Cc: Jeff Hammerbacher ; Chad Walters 
> ; Owen O'Malley 
> > > Sent: Wednesday, June 4, 2008 6:00:01 PM
> > > Subject: Monthly user group meeting
> > > 
> > > The next user group meeting is scheduled for June 18th from 6-7:30 pm at
> > > the Yahoo! Mission College campus (2821 Mission College, Santa Clara).
> > > Registration, driving directions etc are at
> > > http://upcoming.yahoo.com/event/760573/
> > > 
> > > 
> > > 
> > > Agenda:
> > > 
> > > 1)   Hadoop at Facebook, Hive - Jeff Hammerbacher
> > > 
> > > 2)   Using Zookeeper - Ted Dunning, Ben Reed
> > > 
> > > ... And plenty of networking, as always.
> > > 
> > > 
> > > 
> > > Ajay
> > 
> 
> 
> 
> ---
> Chris Doherty
> chris [at] randomcamel.net
> 
> "I think," said Christopher Robin, "that we ought to eat
> all our provisions now, so we won't have so much to carry."
>-- A. A. Milne
> ---



Re: Gigablast.com search engine, 10billion pages!!!

2008-06-05 Thread Otis Gospodnetic
Dan,

You may want to ask on Solr, Lucene, or Nutch lists.  However, I can tell you 
already that these numbers look a little...overly optimistic :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Dan Segel <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, June 5, 2008 9:12:31 AM
> Subject: Gigablast.com search engine, 10billion pages!!!
> 
> Our ultimate goal is to basically replicate gigablast.com search engine.
> They claim to have less than 500 servers that contain 10billion pages
> indexed, spidered and updated on a routine basis...  I am looking at
> featuring 500 million pages indexed per node, and have a total of 20 nodes.
> Each node will feature 2 quad core processes, 4TB (at raid 5) and 32 gb of
> ram.  I believe this can be done however how many searches per second do you
> think would be realistic in this instance?  We are looking at achieving
> 25+/- searches per second ultimately spread out over the 20 nodes... I can
> really uses some advice with this one.
> 
> Thanks,
> D. Segel



Re: Monthly user group meeting

2008-06-04 Thread Otis Gospodnetic
Hi,

 
Any chance the videos will be taken *and* made available outside Yahoo?

The videos from the Hadoop summit are still not available:
http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_video.html

And at this point it looks like they never will be available :(

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Ajay Anand <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Cc: Jeff Hammerbacher <[EMAIL PROTECTED]>; Chad Walters <[EMAIL PROTECTED]>; 
> Owen O'Malley <[EMAIL PROTECTED]>
> Sent: Wednesday, June 4, 2008 6:00:01 PM
> Subject: Monthly user group meeting
> 
> The next user group meeting is scheduled for June 18th from 6-7:30 pm at
> the Yahoo! Mission College campus (2821 Mission College, Santa Clara).
> Registration, driving directions etc are at
> http://upcoming.yahoo.com/event/760573/
> 
> 
> 
> Agenda:
> 
> 1)   Hadoop at Facebook, Hive - Jeff Hammerbacher
> 
> 2)   Using Zookeeper - Ted Dunning, Ben Reed
> 
> ... And plenty of networking, as always.
> 
> 
> 
> Ajay



Adding new disk to DNs - FAQ #15 clarification

2008-06-03 Thread Otis Gospodnetic
Hi,

I'm about to add a new disk (under a new partition) to some existing DataNodes 
that are nearly full.  I see FAQ #15:

15. HDFS. How do I set up a hadoop node to use multiple volumes? 
Data-nodes can store blocks in multiple directories typically allocated on 
different local disk drives. In order to setup multiple directories one needs 
to specify a comma separated list of pathnames as a value of the configuration 
parameter  dfs.data.dir. Data-nodes will attempt to place equal amount of data 
in each of the directories. 

I think some clarification around "will attempt to place equal amount of data 
in each of the directories" is needed:

* Does that apply only if you have multiple disks in a DN from the beginning, 
and thus Hadoop just tries to write to all of them equally?
* Or does that apply to situations like mine, where one disk is nearly 
completely full, and then a new, empty disk is added?

Put another way, if I add thew new disk via dfs.data.dir, will Hadoop:
1) try to write the same amount of data to both disks from now on, or
2) try to write exclusively to the new/empty disk first, in order to get it to 
roughly 95% full?

In my case I'd like to add the new mount point to dfs.data.dir and rely on 
Hadoop realizing that it now has one disk partition that is nearly full, and 
one that is completely empty, and just start writing to the new partition until 
it reaches the equilibrium.  If that's not possible, is there a mechanism by 
which I can tell Hadoop to move some of the data from the old partition to the 
new partition?  Something like a balancer tool, but applicable to a single DN 
with multiple volumes...

Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Hadoop 0.17 AMI?

2008-05-21 Thread Otis Gospodnetic
Hi Jeff,

0.17.0 was released yesterday, from what I can tell.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Jeff Eastman <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, May 21, 2008 11:18:56 AM
> Subject: Re: Hadoop 0.17 AMI?
> 
> Any word on 0.17? I was able to build an AMI from a trunk checkout and 
> deploy a single node cluster but the create-hadoop-image-remote script 
> really wants a tarball in the archive. I'd rather not waste time munging 
> the scripts if a release is near.
> 
> Jeff
> 
> Nigel Daley wrote:
> > Hadoop 0.17 hasn't been released yet.  I (or Mukund) is hoping to call 
> > a vote this afternoon or tomorrow.
> >
> > Nige
> >
> > On May 14, 2008, at 12:36 PM, Jeff Eastman wrote:
> >> I'm trying to bring up a cluster on EC2 using
> >> (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
> >> version to use because of the DNS improvements, etc. Unfortunately, I
> >> cannot find a public AMI with this build. Is there one that I'm not
> >> finding or do I need to create one?
> >>
> >> Jeff
> >>
> >
> >
> >



dfs.block.size vs avg block size

2008-05-16 Thread Otis Gospodnetic
Hello,

I checked the ML archives and the Wiki, as well as the HDFS user guide, but 
could not find information about how to change block size of an existing HDFS.

After running fsck I can see that my avg. block size is 12706144 B (cca 12MB), 
and that's a lot smaller than what I have configured: dfs.block.size=67108864 B

Is the difference between the configured block size and actual (avg) block size 
results effectively wasted space?
If so, is there a way to change the DFS block size and have Hadoop shrink all 
the existing blocks?
I am OK with not running any jobs on the cluster for a day or two if I can do 
something to free up the wasted disk space.


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Hadoop summit video capture?

2008-05-14 Thread Otis Gospodnetic
I tried finding those Hadoop videos on Veoh, but got 0 hits:

http://www.veoh.com/search.html?type=v&search=hadoop


Got URL, Ted?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Ted Dunning <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, May 14, 2008 1:50:02 PM
> Subject: Re: Hadoop summit video capture?
> 
> 
> Use Veoh instead.  Higher resolution.  Higher uptime.  Nicer embeds.
> 
> And the views get chewed up by hadoop instead of google's implementation!
> 
> (conflict of interest on my part should be noted)
> 
> 
> On 5/14/08 10:43 AM, "Cole Flournoy" wrote:
> 
> > Man, yahoo needs to get there act together with their video service (the
> > videos are still down)!  Is there anyway someone can upload these videos to
> > youTube and provide a link?
> > 
> > Thanks,
> > Cole
> > 
> > On Wed, Apr 23, 2008 at 11:36 AM, Chris Mattmann <
> > [EMAIL PROTECTED]> wrote:
> > 
> >> Thanks, Jeremy. Appreciate it.
> >> 
> >> Cheers,
> >>  Chris
> >> 
> >> 
> >> 
> >> On 4/23/08 8:25 AM, "Jeremy Zawodny" wrote:
> >> 
> >>> Certainly...
> >>> 
> >>> Stay tuned.
> >>> 
> >>> Jeremy
> >>> 
> >>> On 4/22/08, Chris Mattmann wrote:
>  
>  Hi Jeremy,
>  
>  Any chance that these videos could be made in a downloadable format
> >> rather
>  than thru Y!'s player?
>  
>  For example I'm traveling right now and would love to watch the rest of
>  the
>  presentations but the next few hours I won't have an internet
> >> connection.
>  
>  So, my request won't help me, but may help folks in similar situations.
>  
>  Just a thought, thanks!
>  
>  Cheers,
>    Chris
>  
>  
>  
>  On 4/22/08 1:27 PM, "Jeremy Zawodny" wrote:
>  
> > Okay, things appear to be fixed now.
> > 
> > Jeremy
> > 
> > On 4/20/08, Jeremy Zawodny wrote:
> >> 
> >> Not yet... there seem to be a lot of cooks in the kitchen on this one,
>  but
> >> we'll get it fixed.
> >> 
> >> Jeremy
> >> 
> >> On 4/19/08, Cole Flournoy wrote:
> >>> 
> >>> Any news on when the videos are going to work?  I am dieing to watch
> >>> them!
> >>> 
> >>> Cole
> >>> 
> >>> On Fri, Apr 18, 2008 at 8:10 PM, Jeremy Zawodny 
> >>> wrote:
> >>> 
>  Almost... The videos and slides are up (as of yesterday) but there
>  appears
>  to be an ACL problem with the videos.
>  
>  
>  
>  
> >> http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_vi
>  deo.html
>  
>  Jeremy
>  
>  On 4/17/08, wuqi wrote:
> > 
> > Are the videos and slides available now?
> > 
> > 
> > - Original Message -
> > From: "Jeremy Zawodny" 
> > To: 
> > 
> > Cc: 
> > Sent: Thursday, March 27, 2008 11:01 AM
> > Subject: Re: Hadoop summit video capture?
> > 
> > 
> >> Slides and video go up next week.  It just takes a few days to
>  assemble.
> >> 
> >> We're glad everyone enjoyed it and was okay with a last minute
>  venue
> > change.
> >> 
> >> Thanks also to Amazon.com and the NSF (not NFS as I typo'd on the
> > printed
> >> agenda!)
> >> 
> >> Jeremy
> >> 
> >> On 3/26/08, Cam Bazz wrote:
> >>> 
> >>> Yes, are there any materials for those who could not come to
>  summit? I
> > am
> >>> really curious about this summit.
> >>> 
> >>> Is the material posted on the hadoop page?
> >>> 
> >>> Best Regards,
> >>> -C.A.
> >>> 
> >>> On Wed, Mar 26, 2008 at 8:48 AM, Isabel Drost <
> >>> [EMAIL PROTECTED]>
> >>> wrote:
> >>> 
> >>> 
>  On Wednesday 26 March 2008, Jeff Eastman wrote:
>  I personally got a lot of positive feedback and interest in
>  Mahout,
> > so
>  expect your inbox to explode in the next couple of days.
>  
>  Sounds great. I was already happy we received quite some
>  traffic
> > after
> >>> we
>  published that we would take part in the GSoC.
>  
>  Isabel
>  
>  --
>  kernel, n.: A part of an operating system that preserves
>  the
> >>> medieval
>  traditions of sorcery and black art.
>   |\  _,,,---,,_   Web:  
>   /,`.-'`'-.  ;-;;,_
>   |,4-  ) )-,_..;\ (  `'-'
>  '---''(_/--'  `-'\_) (fL)  IM:  
>  
> >>> 
> >> 
>  
> >>> 
> >>> 
> >> 
>  
>  
>  ___

Re: HDFS corrupt...how to proceed?

2008-05-13 Thread Otis Gospodnetic
Hi,

I'd love to see the DRBD+Hadoop write up!  Not only would this be useful for 
Hadoop, I can see this being useful for Solr (master replication).


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: C G <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Monday, May 12, 2008 2:40:57 PM
> Subject: Re: HDFS corrupt...how to proceed?
> 
> Thanks to everyone who responded.   Things are back on the air now - all the 
> replication issues seem to have gone away.  I am wading through a detailed 
> fsck 
> output now looking for specific problems on a file-by-file basis.
>   
>   Just in case anybody is interested, we mirror our master nodes using DRBD.  
> It 
> performed very well in this first "real world" test.  If there is interest I 
> can 
> write up how we protect our master nodes in more detail and share w/the 
> community.
>   
>   Thanks,
>   C G
> 
> Ted Dunning wrote:
>   
> 
> You don't need to correct over-replicated files.
> 
> The under-replicated files should cure themselves, but there is a problem on
> old versions where that doesn't happen quite right.
> 
> You can use hadoop fsck / to get a list of the files that are broken and
> there are options to copy what remains of them to lost+found or to delete
> them.
> 
> Other than that, things should correct themselves fairly quickly.
> 
> 
> On 5/11/08 8:23 PM, "C G" 
> wrote:
> 
> > Hi All:
> > 
> > We had a primary node failure over the weekend. When we brought the node
> > back up and I ran Hadoop fsck, I see the file system is corrupt. I'm unsure
> > how best to proceed. Any advice is greatly appreciated. If I've missed a
> > Wiki page or documentation somewhere please feel free to tell me to RTFM and
> > let me know where to look.
> > 
> > Specific question: how to clear under and over replicated files? Is the
> > correct procedure to copy the file locally, delete from HDFS, and then copy
> > back to HDFS?
> > 
> > The fsck output is long, but the final summary is:
> > 
> > Total size: 4899680097382 B
> > Total blocks: 994252 (avg. block size 4928006 B)
> > Total dirs: 47404
> > Total files: 952070
> > 
> > CORRUPT FILES: 2
> > MISSING BLOCKS: 24
> > MISSING SIZE: 1501009630 B
> > 
> > Over-replicated blocks: 1 (1.0057812E-4 %)
> > Under-replicated blocks: 14958 (1.5044476 %)
> > Target replication factor: 3
> > Real replication factor: 2.9849212
> > 
> > The filesystem under path '/' is CORRUPT
> > 
> > 
> > -
> > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it
> > now.
> 
> 
> 
>   
> -
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it 
> now.



Re: why it stopped at Reduce phase?

2008-05-13 Thread Otis Gospodnetic
It appears that your hard disk is full on one of your 2 slaves, that is all.  
If you are on UNIX/linux, type this at the prompt:
df

You should see 100% for the partition where you put HDFS.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: wangxiaowei <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, May 13, 2008 11:12:36 AM
> Subject: why it stopped at Reduce phase? 
> 
> hi all:
>I uses two computers A and B as a hadoop cluster,A is JobTracker and 
> NameNode,both A and B are slaves.
> The input data size is about 80MB,including 100,000records. The job is to 
> read 
> one record a time and find some useful content in it,and transmit it to 
> reduce.
>But when I submit it ,it just run map tasks,and the reduce task did not 
> run 
> at all!This is the JobTacker`s log file:
> 2008-05-13 21:02:00,007 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from 
> task_200805132055_0001_m_00_0: FSError: java.io.IOException: No space 
> left 
> on device
> 2008-05-13 21:02:11,952 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from 
> task_200805132055_0001_m_01_0: FSError: java.io.IOException: No space 
> left 
> on device
> 2008-05-13 21:02:11,953 INFO org.apache.hadoop.mapred.TaskInProgress: Error 
> from 
> task_200805132055_0001_m_06_0: FSError: java.io.IOException: No space 
> left 
> on device
>I think it is something wrong with the configuration file,Can you give me 
> some suggestions?
>Do you meet the same questions?



Re: Balancer not balancing 100%?

2008-05-11 Thread Otis Gospodnetic
Oh, and on top of the above, I just observed that even though bin/hadoop 
balancer exits immediately and reports the cluster is fully balanced, I do see 
*very* few blocks (1-2 blocks per node) getting moved every time I run 
balancer.  It feels as if the balancer does actually find some blocks that it 
could move around, moves them, but then quickly gets lazy and just exits 
claiming the cluster is/was already balanced.  I just ran balancer about 10 
times and each time it moved a couple of blocks and then exited.

Makes me want to do ugly stuff like:
for ((i=1; i <= ; i++)); do echo $i; bin/hadoop balancer; done


...just to get to the point where all 4 nodes have the same number of blocks 
and thus the same percentage of disk used...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Sunday, May 11, 2008 2:36:24 PM
> Subject: Balancer not balancing 100%?
> 
> Hi,
> 
> I have 4 identical nodes in a Hadoop cluster (all functioning as DNs).  One 
> of 
> the 4 nodes is a new node that I recently added.  I ran the balancer a few 
> times 
> and it did move some of the blocks from the other 3 nodes to the new node.  
> However, the 4 nodes are still not 100% balanced (according to the GUI), even 
> though running bin/hadoop balancer says the cluster is balanced:
> 
> Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved
> The cluster is balanced. Exiting...
> Balancing took 666.0 milliseconds
> 
> 
> The 3 old DNs are about 60% full (around 24K blocks), which the 1 new DN is 
> only 
> about 50% full (around 21K blocks).  I restarted the NN and re-ran the 
> balancer, 
> bug got the same output: "The cluster is balanced. Exiting..."
> 
> Is this a bug or is it somehow possible for a cluster to be balanced, yet 
> have 
> nodes with different number of blocks?
> 
> Thanks,
> Otis



Balancer not balancing 100%?

2008-05-11 Thread Otis Gospodnetic
Hi,

I have 4 identical nodes in a Hadoop cluster (all functioning as DNs).  One of 
the 4 nodes is a new node that I recently added.  I ran the balancer a few 
times and it did move some of the blocks from the other 3 nodes to the new 
node.  However, the 4 nodes are still not 100% balanced (according to the GUI), 
even though running bin/hadoop balancer says the cluster is balanced:

Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved
The cluster is balanced. Exiting...
Balancing took 666.0 milliseconds


The 3 old DNs are about 60% full (around 24K blocks), which the 1 new DN is 
only about 50% full (around 21K blocks).  I restarted the NN and re-ran the 
balancer, bug got the same output: "The cluster is balanced. Exiting..."

Is this a bug or is it somehow possible for a cluster to be balanced, yet have 
nodes with different number of blocks?

Thanks,
Otis


Block re-balancing speed/slowness

2008-05-09 Thread Otis Gospodnetic
Hi,

First off, big thanks to Lohit and Hairong with help with HDFS "corruption", DN 
decommissioning and block re-balancing!

I'm now re-balancing, but like Ted Dunning noted in 
http://markmail.org/message/fzd33k7a3isijto5 , this seems to be a veeery slow 
process.  Here are some concrete numbers:

$ bin/hadoop balancer
Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved
May 9, 2008 7:14:28 PM0 0 KB   100.96 GB
  10 GB
May 9, 2008 7:37:28 PM1409.66 MB99.64 GB
  10 GB
May 9, 2008 8:00:59 PM2840.89 MB 98.3 GB
  10 GB
May 9, 2008 8:22:29 PM3  1.18 GB   97 GB
  10 GB
May 9, 2008 8:44:59 PM4  1.58 GB 95.7 GB
  10 GB
May 9, 2008 9:07:30 PM5   2.1 GB94.42 GB
  10 GB
May 9, 2008 9:29:31 PM6  2.42 GB93.09 GB
  10 GB
May 9, 2008 9:52:02 PM7  2.82 GB91.91 GB
  10 GB
May 9, 2008 10:14:02 PM   8  3.47 GB90.57 GB
  10 GB

10 GB in 3 h doesn't that seem slow?  I can rsync 1GB of data between 2 
(EC2) boxes in this cluster in about a minute.

$ bc
3*60*60 = 10800 seconds
85899345920/10800 = 7953643 bits/sec
7953643/1024 = 7767 kb/sec
7953643/1024/1024 = 7 Mb/sec

Is the block balancer purposely not making 100% use of the available bandwidth 
for some reason?

Or, wait, the "already moved" and "left to move" numbers don't match up, I just 
noticed.  Should one be looking at "bytes being moved" column instead?  In 
other words 8x10GB in 3h above?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Corrupt HDFS and salvaging data

2008-05-09 Thread Otis Gospodnetic
Hi,

 > A default replication factor of 3 does not mean that every block's

> replication factor in the file system is 3.

Hm, and I thought that is exactly what it meant.  What does it mean then?  Or 
are you saying:
The number of block replicas matches the r.f. that was in place when the block 
was *created* ?

> In case (1), some blocks have a replication factor which is less than 3. So
> the average replication factor is less than 3. But no missing replicas. 

Makes sense.  Most likely due to the repl. fact. being = 1 at some point.  But 
then why does bin/hadoop balancer tell me that the cluster is balanced?  Does 
it not take into consideration the *current* replication factor?

> In case 2, some blocks have zero replicas, so only 92.72564% are minimally
> replicated. Those missing blocks must have a replication factor of 1 and
> were placed on the removed DN.

Makes sense.  So there are two things that need to be done:
- get the blocks on the about to be removed DN off of that DN, so copies exist 
elsewhere (decommissioning)
- get the cluster to re-balance, factoring in the *current* replication factor. 
(re-balancing)

Is this correct?

I think that's what your other email said (FAQ #17).  I'm doing that now and it 
seems to be progressing, although I started the balancer immediately after 
running dfsadmin -refreshNodes (it didn't block, so I thought it didn't 
work...).  I hope the fact that decomission and balancer are running 
simultaneously doesn't cause problems...

Thanks!
Otis


> On 5/9/08 7:16 AM, "Otis Gospodnetic" wrote:
> 
> > Hi,
> > 
> > Here are 2 "bin/hadoop fsck / -files -blocks locations" reports:
> > 
> > 1) For the old HDFS cluster, reportedly HEALTHY, but with this 
> > inconsistency:
> > 
> > http://www.krumpir.com/fsck-old.txt.zip   ( < 1MB)
> > 
> > Total blocks:  32264 (avg. block size 11591245 B)
> > Minimally replicated blocks:   32264 (100.0 %) <== looks GOOD, 
> > matches
> > "Total blocks"
> > Over-replicated blocks:0 (0.0 %)
> > Under-replicated blocks:   0 (0.0 %)
> > Mis-replicated blocks: 0 (0.0 %)
> > Default replication factor:3   <== 
> > should
> > have 3 copies of each block
> > Average block replication: 2.418051<== ???  
> > shouldn't
> > this be 3??
> > Missing replicas:  0 (0.0 %)<==
> > if the above is 2.41... how can I have 0 missing replicas?
> > 
> > 2) For the cluster with 1 old DN replaced with 1 new DN:
> > 
> > http://www.krumpir.com/fsck-1newDN.txt.zip ( < 800KB)
> > 
> >  Minimally replicated blocks:   29917 (92.72564 %)
> >  Over-replicated blocks:0 (0.0 %)
> >  Under-replicated blocks:   17124 (53.074635 %)
> >  Mis-replicated blocks: 0 (0.0 %)
> >  Default replication factor:3
> >  Average block replication: 1.8145611
> >  Missing replicas:  17124 (29.249296 %)
> > 
> > 
> > 
> > Any help would be appreciated.
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > - Original Message 
> >> From: lohit 
> >> To: core-user@hadoop.apache.org
> >> Sent: Friday, May 9, 2008 2:47:39 AM
> >> Subject: Re: Corrupt HDFS and salvaging data
> >> 
> >> When you say all daemons, do you mean the entire cluster, including the
> >> namenode?
> >>> According to your explanation, this means that after I removed 1 DN I
> >>> started 
> >> missing about 30% of the blocks, right?
> >> No, You would only miss the replica. If all of your blocks have replication
> >> factor of 3, then you would miss only one replica which was on this DN.
> >> 
> >> It would be good to see full report
> >> could you run hadoop fsck / -files -blocks -location?
> >> 
> >> That would give you much more detailed information.
> >> 
> >> 
> >> - Original Message 
> >> From: Otis Gospodnetic
> >> To: core-user@hadoop.apache.org
> >> Sent: Thursday, May 8, 2008 10:54:53 PM
> >> Subject: Re: Corrupt HDFS and salvaging data
> >> 
> >> Lohit,
> >> 
> >> 
> >> I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and
> >> started 
> >> all daemons.
> >> I see the fsck report does include this:
> >> Missing replicas:  17025 (29.727087 %)
> >&

Re: Corrupt HDFS and salvaging data

2008-05-09 Thread Otis Gospodnetic
Hi,

Here are 2 "bin/hadoop fsck / -files -blocks locations" reports:

1) For the old HDFS cluster, reportedly HEALTHY, but with this inconsistency:

http://www.krumpir.com/fsck-old.txt.zip   ( < 1MB)

Total blocks:  32264 (avg. block size 11591245 B)
Minimally replicated blocks:   32264 (100.0 %) <== looks GOOD, matches 
"Total blocks"
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3   <== should 
have 3 copies of each block
Average block replication: 2.418051<== ???  shouldn't 
this be 3??
Missing replicas:  0 (0.0 %)<==
if the above is 2.41... how can I have 0 missing replicas?

2) For the cluster with 1 old DN replaced with 1 new DN:

http://www.krumpir.com/fsck-1newDN.txt.zip ( < 800KB)

 Minimally replicated blocks:   29917 (92.72564 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   17124 (53.074635 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3
 Average block replication: 1.8145611
 Missing replicas:  17124 (29.249296 %)



Any help would be appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 2:47:39 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> When you say all daemons, do you mean the entire cluster, including the 
> namenode?
> >According to your explanation, this means that after I removed 1 DN I 
> >started 
> missing about 30% of the blocks, right?
> No, You would only miss the replica. If all of your blocks have replication 
> factor of 3, then you would miss only one replica which was on this DN.
> 
> It would be good to see full report
> could you run hadoop fsck / -files -blocks -location?
> 
> That would give you much more detailed information. 
> 
> 
> - Original Message 
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 10:54:53 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Lohit,
> 
> 
> I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and 
> started 
> all daemons.
> I see the fsck report does include this:
> Missing replicas:  17025 (29.727087 %)
> 
> According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
> removed?  But how could that be when I have replication factor of 3?
> 
> If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
> removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
> that mean that everything is peachy and that if my replication factor is 3 
> then 
> when I remove 1 DN, I should have only some portion of blocks 
> under-replicated, 
> but not *completely* missing from HDFS?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: lohit 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, May 9, 2008 1:33:56 AM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi Otis,
> > 
> > Namenode has location information about all replicas of a block. When you 
> > run 
> > fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> > reports the block as missing. Otherwise they are added to under replicated 
> > blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> > missing blocks are moved to /lost+found or deleted depending on the option. 
> > At what point did you run the fsck command, was it after the datanodes were 
> > stopped? When you run namenode -format it would delete directories 
> > specified 
> in 
> > dfs.name.dir. If directory exists it would ask for confirmation. 
> > 
> > Thanks,
> > Lohit
> > 
> > - Original Message 
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 9:00:34 PM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > Update:
> > It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> > replicas is missing (or something like that).
> > fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After 
> > I 
> > resta

Re: Corrupt HDFS and salvaging data

2008-05-09 Thread Otis Gospodnetic
Hi,

Yes, when I say "all daemons" I mean all 4 types - NN, DNs, JT, and TTs.
The cluster did initially use replication factor of 1.  This was later changed 
to 3.  About 90% of cluster's runtime was spent with repl. factor of 3.

If I run bin/hadoop balancer with all 4 old DNs it tells me that the cluster is 
balanced which, if I understand this correctly, means that each and every block 
has at least 3 replicas.  But this then doesn't match the fsck report (all old 
DNs, no new DNs):

Status: HEALTHY
 Total size:373979942715 B
 Total dirs:14106
 Total files:   29548
 Total blocks:  32264 (avg. block size 11591245 B)
 Minimally replicated blocks:   32264 (100.0 %) <== looks GOOD, matches 
"Total blocks"
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   0 (0.0 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3<== should 
have 3 copies of each block
 Average block replication: 2.418051  <== ???  
shouldn't this be 3??
 Missing replicas:  0 (0.0 %) <== if the 
above is 2.41... how can I have 0 missing replicas?
 Number of data-nodes:  4
 Number of racks:   1

I'll send links to full fsck reports in a separate email.

Thanks,

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 2:47:39 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> When you say all daemons, do you mean the entire cluster, including the 
> namenode?
> >According to your explanation, this means that after I removed 1 DN I 
> >started 
> missing about 30% of the blocks, right?
> No, You would only miss the replica. If all of your blocks have replication 
> factor of 3, then you would miss only one replica which was on this DN.
> 
> It would be good to see full report
> could you run hadoop fsck / -files -blocks -location?
> 
> That would give you much more detailed information. 
> 
> 
> - Original Message 
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 10:54:53 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Lohit,
> 
> 
> I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and 
> started 
> all daemons.
> I see the fsck report does include this:
> Missing replicas:  17025 (29.727087 %)
> 
> According to your explanation, this means that after I removed 1 DN I started 
> missing about 30% of the blocks, right?
> Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
> removed?  But how could that be when I have replication factor of 3?
> 
> If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
> removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
> that mean that everything is peachy and that if my replication factor is 3 
> then 
> when I remove 1 DN, I should have only some portion of blocks 
> under-replicated, 
> but not *completely* missing from HDFS?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: lohit 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, May 9, 2008 1:33:56 AM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi Otis,
> > 
> > Namenode has location information about all replicas of a block. When you 
> > run 
> > fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> > reports the block as missing. Otherwise they are added to under replicated 
> > blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> > missing blocks are moved to /lost+found or deleted depending on the option. 
> > At what point did you run the fsck command, was it after the datanodes were 
> > stopped? When you run namenode -format it would delete directories 
> > specified 
> in 
> > dfs.name.dir. If directory exists it would ask for confirmation. 
> > 
> > Thanks,
> > Lohit
> > 
> > - Original Message 
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 9:00:34 PM
> > Subject: Re: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > Update:
> > It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> > replicas is missing (or something like that).
> > fsck reported corrupt HDFS

Re: Corrupt HDFS and salvaging data

2008-05-08 Thread Otis Gospodnetic
Lohit,


I run fsck after I replaced 1 DN (with data on it) with 1 blank DN and started 
all daemons.
I see the fsck report does include this:
Missing replicas:  17025 (29.727087 %)

According to your explanation, this means that after I removed 1 DN I started 
missing about 30% of the blocks, right?
Wouldn't that mean that 30% of all blocks were *only* on the 1 DN that I 
removed?  But how could that be when I have replication factor of 3?

If I run bin/hadoop balancer with my old DN back in the cluster (and new DN 
removed), I do get the happy "The cluster is balanced" response.  So wouldn't 
that mean that everything is peachy and that if my replication factor is 3 then 
when I remove 1 DN, I should have only some portion of blocks under-replicated, 
but not *completely* missing from HDFS?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, May 9, 2008 1:33:56 AM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi Otis,
> 
> Namenode has location information about all replicas of a block. When you run 
> fsck, namenode checks for those replicas. If all replicas are missing, then 
> fsck 
> reports the block as missing. Otherwise they are added to under replicated 
> blocks. If you specify -move or -delete option along with fsck, files with 
> such 
> missing blocks are moved to /lost+found or deleted depending on the option. 
> At what point did you run the fsck command, was it after the datanodes were 
> stopped? When you run namenode -format it would delete directories specified 
> in 
> dfs.name.dir. If directory exists it would ask for confirmation. 
> 
> Thanks,
> Lohit
> 
> - Original Message 
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 9:00:34 PM
> Subject: Re: Corrupt HDFS and salvaging data
> 
> Hi,
> 
> Update:
> It seems fsck reports HDFS is corrupt when a significant-enough number of 
> block 
> replicas is missing (or something like that).
> fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
> restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
> and started reporting *healthy* HDFS.
> 
> 
> I'll follow-up with re-balancing question in a separate email.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, May 8, 2008 11:35:01 PM
> > Subject: Corrupt HDFS and salvaging data
> > 
> > Hi,
> > 
> > I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm 
> > trying 
> > not to lose the precious data in it.  I accidentally run bin/hadoop 
> > namenode 
> > -format on a *new DN* that I just added to the cluster.  Is it possible for 
> that 
> > to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> because 
> > bin/stop-all.sh didn't stop them for some reason (it always did so before).
> > 
> > Is there any way to salvage the data?  I have a 4-node cluster with 
> replication 
> > factor of 3, though fsck reports lots of under-replicated blocks:
> > 
> >   
> >   CORRUPT FILES:3355
> >   MISSING BLOCKS:   3462
> >   MISSING SIZE: 17708821225 B
> >   
> > Minimally replicated blocks:   28802 (89.269775 %)
> > Over-replicated blocks:0 (0.0 %)
> > Under-replicated blocks:   17025 (52.76779 %)
> > Mis-replicated blocks: 0 (0.0 %)
> > Default replication factor:3
> > Average block replication: 1.7750744
> > Missing replicas:  17025 (29.727087 %)
> > Number of data-nodes:  4
> > Number of racks:   1
> > 
> > 
> > The filesystem under path '/' is CORRUPT
> > 
> > 
> > What can one do at this point to save the data?  If I run bin/hadoop fsck 
> -move 
> > or -delete will I lose some of the data?  Or will I simply end up with 
> > fewer 
> > block replicas and will thus have to force re-balancing in order to get 
> > back 
> to 
> > a "safe" number of replicas?
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



How to re-balance, NN safe mode

2008-05-08 Thread Otis Gospodnetic
Hi,

(I should prefix this by saying that bin/hadoop fsck reported corrupt HDFS 
after I replaced one of the DNs with a new/empty DN)

I've removed 1 old DN and added 1 new DN .  The cluster has 4 nodes total (all 
4 act as DNs) and replication factor of 3.  I'm trying to re-balance the data 
by following http://wiki.apache.org/hadoop/FAQ#6:
- I stopped all daemons
- I removed the old DN and added the new DN to conf/slaves
- I started all daemons

The new DN shows in the JT and NN GUIs and bin/hadoop dfsadmin -report shows 
it.  At this point I expected NN to figure out that it needs to re-balance 
under-replicated blocks and start pushing data to the new DN.  However, no data 
got copied to the new DN.  I pumped the replication factor to 6 and restarted 
all daemons, but still nothing.  I noticed the NN GUI says the NN is in safe 
mode, but it has been stuck there for 10+ minutes now - too long, it seems.

I then tried running bin/hadoop balancer, but got this:

 
$ bin/hadoop balancer
Received an IO exception: org.apache.hadoop.dfs.SafeModeException: Cannot 
create file/system/balancer.id. Name node is in safe mode.
Safe mode will be turned off automatically.
at 
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:947)
at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:931)
...
...

So now I'm wondering what steps one need to follow when replacing a DN?  Just 
pulling it out and listing a new one in conf/slaves leads to NN getting into 
the permanent(?) safe mode, it seems.

I know I can run bin/hadoop dfsadmin -safemode leave  but is that safe? ;)
If I do that, will I then be able to run bin/hadoop balancer and get some 
replicas of the old HDFS data on the newly added DN?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Corrupt HDFS and salvaging data

2008-05-08 Thread Otis Gospodnetic
Hi,

Update:
It seems fsck reports HDFS is corrupt when a significant-enough number of block 
replicas is missing (or something like that).
fsck reported corrupt HDFS after I replaced 1 old DN with 1 new DN.  After I 
restarted Hadoop with the old set of DNs, fsck stopped reporting corrupt HDFS 
and started reporting *healthy* HDFS.


I'll follow-up with re-balancing question in a separate email.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 11:35:01 PM
> Subject: Corrupt HDFS and salvaging data
> 
> Hi,
> 
> I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm trying 
> not to lose the precious data in it.  I accidentally run bin/hadoop namenode 
> -format on a *new DN* that I just added to the cluster.  Is it possible for 
> that 
> to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
> because 
> bin/stop-all.sh didn't stop them for some reason (it always did so before).
> 
> Is there any way to salvage the data?  I have a 4-node cluster with 
> replication 
> factor of 3, though fsck reports lots of under-replicated blocks:
> 
>   
>   CORRUPT FILES:3355
>   MISSING BLOCKS:   3462
>   MISSING SIZE: 17708821225 B
>   
> Minimally replicated blocks:   28802 (89.269775 %)
> Over-replicated blocks:0 (0.0 %)
> Under-replicated blocks:   17025 (52.76779 %)
> Mis-replicated blocks: 0 (0.0 %)
> Default replication factor:3
> Average block replication: 1.7750744
> Missing replicas:  17025 (29.727087 %)
> Number of data-nodes:  4
> Number of racks:   1
> 
> 
> The filesystem under path '/' is CORRUPT
> 
> 
> What can one do at this point to save the data?  If I run bin/hadoop fsck 
> -move 
> or -delete will I lose some of the data?  Or will I simply end up with fewer 
> block replicas and will thus have to force re-balancing in order to get back 
> to 
> a "safe" number of replicas?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Corrupt HDFS and salvaging data

2008-05-08 Thread Otis Gospodnetic
Hi,

I have a case of a corrupt HDFS (according to bin/hadoop fsck) and I'm trying 
not to lose the precious data in it.  I accidentally run bin/hadoop namenode 
-format on a *new DN* that I just added to the cluster.  Is it possible for 
that to corrupt HDFS?  I also had to explicitly kill DN daemons before that, 
because bin/stop-all.sh didn't stop them for some reason (it always did so 
before).

Is there any way to salvage the data?  I have a 4-node cluster with replication 
factor of 3, though fsck reports lots of under-replicated blocks:

  
  CORRUPT FILES:3355
  MISSING BLOCKS:   3462
  MISSING SIZE: 17708821225 B
  
 Minimally replicated blocks:   28802 (89.269775 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   17025 (52.76779 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3
 Average block replication: 1.7750744
 Missing replicas:  17025 (29.727087 %)
 Number of data-nodes:  4
 Number of racks:   1


The filesystem under path '/' is CORRUPT


What can one do at this point to save the data?  If I run bin/hadoop fsck -move 
or -delete will I lose some of the data?  Or will I simply end up with fewer 
block replicas and will thus have to force re-balancing in order to get back to 
a "safe" number of replicas?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Changing DN hostnames->IPs

2008-05-08 Thread Otis Gospodnetic
Thank you, Raghu.  I'm using 0.16.3, so I should be safe. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Raghu Angadi <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, May 8, 2008 9:10:33 PM
> Subject: Re: Changing DN hostnames->IPs
> 
> Short answer : renaming is not problem.
> 
> If you are running fairly recent Hadoop, NN does not store information 
> about the DataNode persistently. So you should be ok in the sense 
> NameNode does not depend on datanodes before the restart.
> 
> If you are running fairly old Hadoop, functionally it will ok. Only 
> annoying thing would be that you might see all the previous datanodes 
> listed as dead.
> 
> The recent trunk has some issues related using Datanode ID for some 
> unclosed files.. but you may not be affected by that.
> 
> Raghu.
> 
> Otis Gospodnetic wrote:
> > Hi,
> > 
> > Will NN get confused if I change the names of slaves from hostnames to IPs?
> > That is, if I've been running Hadoop for a while, and then decide to shut 
> > down 
> all its daemons, switch to IPs, and start everything back up, will the 
> master/NN 
> still see all the DN slaves as before and will it know they are the same old 
> set 
> of DN slaves?
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 



Changing DN hostnames->IPs

2008-05-08 Thread Otis Gospodnetic
Hi,

Will NN get confused if I change the names of slaves from hostnames to IPs?
That is, if I've been running Hadoop for a while, and then decide to shut down 
all its daemons, switch to IPs, and start everything back up, will the 
master/NN still see all the DN slaves as before and will it know they are the 
same old set of DN slaves?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Hadoop Resiliency

2008-05-06 Thread Otis Gospodnetic
Hi Arv,

1) look for info on "secondary NameNode" on the Hadoop wiki and ML archives.

2) I don't think a NN is supposed to restart a killed DN.  I haven't tried it, 
haven't seen it, but haven't read that anywhere either.

3) I think bin/start-dfs.sh is what you are after, no?  Or at least the one of 
the last couple of lines from there.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Arv Mistry <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Tuesday, May 6, 2008 3:04:56 PM
> Subject: Hadoop Resiliency
> 
>  
> Hi folks,
> 
> I'm new to hadoop and just had a few questions regards resiliency
> 
> i) Does hadoop support redundant NameNodes? I didn't see any mention of
> it.
> 
> ii) In a distributed setup, when you kill a DataNode, should the
> NameNode restart it automatically? I see the NameNode detects
> (eventually) that its down but it never seems to restart it. Is the
> expectation that some kind of wrapper (e.g. Java Service Wrapper) will
> do this.
> 
> iii) Maybe an obvious one, but I couldn't see how you just start a
> DataNode from the scripts. Should I create my own script to do that,
> based on start_dfs?
> 
> Cheers Arv
> 




Re: DistributedCache on Java Objects?

2008-04-30 Thread Otis Gospodnetic
How about using an out of process cache, such as memcached?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: ncardoso <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, April 30, 2008 11:14:54 AM
> Subject: DistributedCache on Java Objects?
> 
> 
> Hello, 
> 
> I'm using Hadoop for distributed text mining of large collection of
> documents, and in my optimizing process, I want to speed things up a bit,
> and I want to know how can I do this step with Hadoop...
> 
> Each Map process takes a group of documents, analyses each sentence, and for
> certain patterns it queries a database and some indexes to provide a proper
> answer. This step can take a while, so I'm caching the results in a
> LinkedHashMap, which works pretty well for standalone jobs, and avoids
> repeated queries for same patterns in a documet. 
> 
> I think it would be great to share this LinkedHashMap cache object for all
> Map instances, so that if the #2 Map object finds the same pattern as the #1
> Map object previously noticed on other document, it can use the cached
> result that #1 Map placed there for all Map objects, saving some time. 
> 
> Right now, the DistributedCache just shares files, archives and jar files.
> Is there any way to share such a Java object such as a LinkedHashMap,
> synchronized or not?  
>  
> -- 
> View this message in context: 
> http://www.nabble.com/DistributedCache-on-Java-Objects--tp16985074p16985074.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> 
> 




Simultaneous jobs and map/reduce sharing

2008-04-24 Thread Otis Gospodnetic
Hi,

I'm trying to run multiple jobs on the same cluster and get them to run 
simultaneously.  I have them running simultaneously "somewhat", but have some 
questions (didn't find answers in FAQ nor Wiki).

Problem:
I start 2 jobs with a short (10 sec) pause between them. Job 1 quickly grabs 
all available map tasks and "hogs" them.  Consequently, Job 2 has all its map 
tasks in pending mode until Job 1 gets closer to the end and starts freeing up 
map tasks.

Example:
Cluster size = 4 nodes
Cluster Map Task Capacity = 16
Cluster Reduce Task Capacity = 8
mapred.tasktracker.map.tasks.maximum = 4
mapred.tasktracker.reduce.tasks.maximum = 2
mapred.map.tasks = 23
mapred.reduce.tasks = 11
mapred.speculative.execution = false

Job 1:
Map Total = 21
Reduce Total = 11

Job 2:
Map Total = 63
Reduce Total = 23

When Job 1 start, it quickly grabs all 16 map tasks (the Cluster Map Task 
Capacity) and only several hours later, when it completes 6 of its 21 tasks 
(21-6=15, which is < 16), it starts freeing up map slots for Job 2.  The same 
thing happens in the reduce phase.

What I'd like is to find a way to control how much each job gets and thus 
schedule them better.  I believe I could change the number of "Map Total" for 
Job 1, so that it is < Cluster Map Task Capacity, so that Job 2 can get at 
least one map slot right away, but then that Job 1 will take longer.  


If it matters, Job 1 and Job 2 are very different - Job 1 is network intensive 
(Nutch fetcher) and Job 2 is CPU and disk IO intensive (Nutch generate job).  
If I start them separately with whole cluster dedicated to a single running 
job, then Job 1 finishes in about 10 hours, and Job 2 finished is about 1.5 
hours.  I was hoping to start the slow Job 1 and, while it's running, maximize 
the use of the CPU by running and completing several Job 2 instances.

Are there other, better options?


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



Re: Where can I purchase blog data

2008-04-24 Thread Otis Gospodnetic
I'd go to Technorati or BuzzLogic or http://spinn3r.com/


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Alan Ho <[EMAIL PROTECTED]>
> To: Hadoop 
> Sent: Thursday, April 24, 2008 10:06:46 AM
> Subject: Where can I purchase blog data
> 
> I'd like to do some historical analysis on blogs. Does anyone know where I 
> can 
> buy blog data ?
> 
> Regards,
> Alan Ho


Re: Hadoop "remembering" old mapred.map.tasks

2008-04-21 Thread Otis Gospodnetic
It turns out Hadoop was not remembering anything and the answer is in the FAQ:

http://wiki.apache.org/hadoop/FAQ#13

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Sunday, April 20, 2008 8:14:43 PM
> Subject: Hadoop "remembering" old mapred.map.tasks
> 
> Hi,
> 
> Does Hadoop cache settings set in hadoop-*xml between runs?
> I'm using Hadoop 0.16.2 and have initially set the number of map and reduce 
> tasks to 8 of each.  After running a number of jobs I wanted to increase that 
> number (to 23 maps and 11 reduces), so I changed the mapred.map.tasks and 
> mapred.reduce.tasks properties in hadoop-site.xml.  I then stopped everything 
> (stop-all.sh) and copied my modified hadoop-site.xml to all nodes in the 
> cluster.  I also rebuilt the .job file and pushed that out to all nodes, too.
> 
> However, when I start everything up again I *still* see Map Task Capacity is 
> equal to 8, and the same for Reduce Task Capacity.
> Am I supposed to do something in addition to the above to make Hadoop 
> "forget" 
> my old settings?  I can't find *any* references to mapred.map.tasks in any of 
> the Hadoop files except for my hadoop-site.xml, so I can't figure out why 
> Hadoop 
> is still stuck on 8.
> 
> Although the max capacity is set to 8, when I run my jobs now I *do* see that 
> they get broken up into 23 maps and 11 reduces (it was 8 before), but only 8 
> of 
> them run in parallel.  There are 4 dual-code machines in the cluster for a 
> total 
> of 8 cores.  Is Hadoop able to figure this out and that is why it runs only 8 
> tasks in parallel, despite my higher settings?
> 
> Thanks,
> Otis



Hadoop "remembering" old mapred.map.tasks

2008-04-20 Thread Otis Gospodnetic
Hi,

Does Hadoop cache settings set in hadoop-*xml between runs?
I'm using Hadoop 0.16.2 and have initially set the number of map and reduce 
tasks to 8 of each.  After running a number of jobs I wanted to increase that 
number (to 23 maps and 11 reduces), so I changed the mapred.map.tasks and 
mapred.reduce.tasks properties in hadoop-site.xml.  I then stopped everything 
(stop-all.sh) and copied my modified hadoop-site.xml to all nodes in the 
cluster.  I also rebuilt the .job file and pushed that out to all nodes, too.

However, when I start everything up again I *still* see Map Task Capacity is 
equal to 8, and the same for Reduce Task Capacity.
Am I supposed to do something in addition to the above to make Hadoop "forget" 
my old settings?  I can't find *any* references to mapred.map.tasks in any of 
the Hadoop files except for my hadoop-site.xml, so I can't figure out why 
Hadoop is still stuck on 8.

Although the max capacity is set to 8, when I run my jobs now I *do* see that 
they get broken up into 23 maps and 11 reduces (it was 8 before), but only 8 of 
them run in parallel.  There are 4 dual-code machines in the cluster for a 
total of 8 cores.  Is Hadoop able to figure this out and that is why it runs 
only 8 tasks in parallel, despite my higher settings?

Thanks,
Otis


Re: hadoop and talend integration

2008-03-30 Thread Otis Gospodnetic
Hello,

I don't have an answer, but I just noticed that there is a proposal for 
Hadoop+Talend for Google Summer of Code, so maybe we'll have this integration 
working by Fall.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: dito subandono <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Sunday, March 30, 2008 7:01:07 PM
Subject: hadoop and talend integration

Hi there everyone

I'm still new in Hadoop and would like to ask some question. Here is the
situation.

Talend is a data integration tool that can extract data from many sources,
manipulate them and send the result to the target system (in my example
from a CSV file to MySql database). With talend's GUI editor I can make
the Java code just by draging n droping the components, then configure
the properties of each components.  I set the CSV input file is from my
local drive then map it to get the fields and then transfer it to MySql
database with Insert command generated by the mapper.

I can export the code into Plain Old Java Object and it also include the .sh

file that can execute the script.

My question is how do I make that script work in Hadoop? Do I have to make
a template code so my Talend script could work.

Thank you very much for the help





Re: Hadoop cookbook / snippets site?

2008-03-26 Thread Otis Gospodnetic
Hadoop's Wiki sounds like a fine place for this (to start).

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 26, 2008 2:07:38 PM
Subject: Re: Hadoop cookbook / snippets site?


On Mar 26, 2008, at 10:08 AM, Parand Darugar wrote:

> Hello,
>
> Is there a hadoop recipes / snippets / cookbook site? I'm thinking  
> something like the Python Cookbook (http://aspn.activestate.com/ 
> ASPN/Python/Cookbook/) or Django Snippets (http:// 
> www.djangosnippets.org/), where people can post code and commentary  
> for common tasks.
>

Not really. Very good idea though ...

However, please feel free to contribute your useful recipes to either  
mapred.lib or as a separate contrib project (trunk/src/contrib).

Arun

> Best,
>
> Parand






Hadoop summit video capture?

2008-03-25 Thread Otis Gospodnetic
Hi,

Wasn't there going to be a live stream from the Hadoop summit?  I couldn't find 
any references on the event site/page, and searches on veoh, youtube and google 
video yielded nothing.

Is an archived version of the video (going to be) available?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Re: Map/reduce with input files on S3

2008-03-25 Thread Otis Gospodnetic
I don't have the direct answer, but you could also copy the data from S3 to 
local EC2 disk and run from there.  The transfer between S3 and EC2 is free.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Prasan Ary <[EMAIL PROTECTED]>
To: hadoop 
Sent: Tuesday, March 25, 2008 4:07:15 PM
Subject: Map/reduce with input files on S3

I am running hadoop on EC2. I want to run a jar MR application on EC2 such that 
input and output files are on S3.
   
  I configured hadoop-site.xml so that  fs.default.name property points to my 
s3 bucket with all required identifications  (eg; s3://:@ ). I created an input directory in this bucket and put an input 
file in this directory. Then I restarted hadoop so that the new configuration 
takes into effect.
   
  When I try to run the jar file now, I get the message "Hook previously 
registered"  and the application dies.
   
  Any Idea what might have gone wrong?
   
  thanks.

   
-
Never miss a thing.   Make Yahoo your homepage.




Re: Partitioning reduce output by date

2008-03-20 Thread Otis Gospodnetic
Thank you, Doug and Ted, this pointed me in the right direction, which lead to 
a custom OutputFormat and a RecordWriter that opens and closes the 
DataOutputStream based on the current key (if current key diff from previous 
key, close previous output and open a new one, then write)

As for partitioning, that worked, too.  My getPartition method now has:

int dateHash = startDate.hashCode();
if (dateHash < 0)
dateHash = -dateHash;
int partitionID = dateHash % numPartitions;
return partitionID;

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 19, 2008 4:39:04 PM
Subject: Re: Partitioning reduce output by date

Otis Gospodnetic wrote:
> That "numPartitions" corresponds to the number of reduce tasks.  What I need 
> is partitioning that corresponds to the number of unique dates (-mm-dd) 
> processed by the Mapper and not the number of reduce tasks.  I don't know the 
> number of distinct dates in the input ahead of time, though, so I cannot just 
> specify the same number of reduces.
> 
> I *can* get the number of unique dates by keeping track of dates in map().  I 
> was going to take this approach and use this number in the getPartition() 
> method, but apparently getPartition(...) is called as each input row is 
> processed by map() call.  This causes a problem for me, as I know the total 
> number of unique dates only after *all* of the input is processed by map().

The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.

I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.

Doug





Default Combiner or default combining behaviour?

2008-03-20 Thread Otis Gospodnetic
Hi,

The MapReduce tutorial mentions Combiners only in passing.  Is there a default 
Combiner or default combining behaviour?

Concretely, I want to make sure that records are not getting combined behind 
the scenes in some way without me seeing it, and causing me to lose data.  For 
instance, if there is a default Combiner or default combining behaviour that 
collapses multiple records with identical keys and values into a single record, 
I'd like to avoid that.  Instead of blindly collapsing identical records, I'd 
want to aggregate their values and emit the aggregate.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Re: Partitioning reduce output by date

2008-03-19 Thread Otis Gospodnetic
Hi,

I'm trying to wrap my head around 
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Side-Effect+Files

But I'm also trying the Partitioner.  The part I'm confused about is this:

public int getPartition(Text key, FooBar value, int *numPartitions*) {
...

That "numPartitions" corresponds to the number of reduce tasks.  What I need is 
partitioning that corresponds to the number of unique dates (-mm-dd) 
processed by the Mapper and not the number of reduce tasks.  I don't know the 
number of distinct dates in the input ahead of time, though, so I cannot just 
specify the same number of reduces.

I *can* get the number of unique dates by keeping track of dates in map().  I 
was going to take this approach and use this number in the getPartition() 
method, but apparently getPartition(...) is called as each input row is 
processed by map() call.  This causes a problem for me, as I know the total 
number of unique dates only after *all* of the input is processed by map().

Am I completely misunderstanding the Partitioner? :)

Thanks,
Otis




- Original Message 
From: Ted Dunning <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, March 18, 2008 9:24:14 PM
Subject: Re: Partitioning reduce output by date


Also see my comment about side effect files.

Basically, if you partition on date, then each set of values in the reduce
will have the same date.  Thus the reducer can open a file, write the
values, close the file (repeat).

This gives precisely the effect you were seeking.


On 3/18/08 6:17 PM, "Martin Traverso" <[EMAIL PROTECTED]> wrote:

>> This makes it sound like the Partitioner is only for intermediate
>> map-outputs, and not outputs of reduces.  Also, it sounds like the number of
>> distinct partitions is tied to the number of reduces.  But what if my job
>> uses, say, only 2 reduce tasks, and my input has 100 distinct dates, and as
>> the result, I want to end up with 100 distinct output files?
>> 
> 
> Check out https://issues.apache.org/jira/browse/HADOOP-2906
> 
> Martin






Re: Partitioning reduce output by date

2008-03-18 Thread Otis Gospodnetic
Thanks for the pointer, Arun.  Earlier, I did look at Partitioner in the 
tutorial:

"Partitioner controls the partitioning of the keys of the   
intermediate map-outputs. The key (or a subset of the key) is used to   
derive the partition, typically by a hash function. The total   number 
of partitions is the same as the number of reduce tasks for the   job. 
Hence this controls which of the m reduce tasks the   intermediate key 
(and hence the record) is sent to for reduction."
 
This makes it sound like the Partitioner is only for intermediate map-outputs, 
and not outputs of reduces.  Also, it sounds like the number of distinct 
partitions is tied to the number of reduces.  But what if my job uses, say, 
only 2 reduce tasks, and my input has 100 distinct dates, and as the result, I 
want to end up with 100 distinct output files?

Also, if there a way to specify the name of the final output/filenames (so that 
each of the 100 output files can have its own distinct name, in my case in the 
-mm-dd format)?

If this is explained somewhere, please point.  If it's not, I can document it 
once I have it working.

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, March 18, 2008 8:17:32 PM
Subject: Re: Partitioning reduce output by date


On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:

> Hi,
>
> What is the best/right way to handle partitioning of the final job  
> output (i.e. output of reduce tasks)?  In my case, I am processing  
> logs whose entries include dates (e.g. "2008-03-01foobar 
> baz").  A single log file may contain a number of different dates,  
> and I'd like to group reduce output by date so that, in the end, I  
> have not a single part-x file but, say, 2008-03-01.txt,  
> 2008-03-02.txt, and so on, one file for each distinct date.
>

You want a custom partitioner...
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Partitioner

Arun

> If it helps, the keys in my job include the dates from the input  
> logs, so I could parse the dates out of the keys in the reduce  
> phase, if that's the thing to do.
>
> I'm looking at OutputFormat and RecordWriter, but I'm not sure if  
> that's the direction I should pursue.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>






Partitioning reduce output by date

2008-03-18 Thread Otis Gospodnetic
Hi,

What is the best/right way to handle partitioning of the final job output (i.e. 
output of reduce tasks)?  In my case, I am processing logs whose entries 
include dates (e.g. "2008-03-01foobarbaz").  A single log file may 
contain a number of different dates, and I'd like to group reduce output by 
date so that, in the end, I have not a single part-x file but, say, 
2008-03-01.txt, 2008-03-02.txt, and so on, one file for each distinct date.

If it helps, the keys in my job include the dates from the input logs, so I 
could parse the dates out of the keys in the reduce phase, if that's the thing 
to do.

I'm looking at OutputFormat and RecordWriter, but I'm not sure if that's the 
direction I should pursue.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Re: pig user meeting, Friday, February 8, 2008

2008-02-06 Thread Otis Gospodnetic
Sorry about the word-wrapping (original email) - Yahoo Mail problem :(

Is anyone going to be capturing the Piglet meeting on video for the those of us 
living in other corners of the planet?

Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Stefan Groschupf <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, January 31, 2008 7:09:53 PM
> Subject: pig user meeting, Friday, February 8, 2008
> 
> Hi
> 
there,
> 
> a
> 
couple
> 
of
> 
people
> 
plan
> 
to
> 
meet
> 
and
> 
talk
> 
about
> 
apache
> 
pig
> 
next
> 
Friday  
> in
> 
the
> 
Mountain
> 
View
> 
area.
> (Event
> 
location
> 
is
> 
not
> 
yet
> 
sure).
> If
> 
you
> 
are
> 
interested
> 
please
> 
RSVP
> 
asap,
> 
so
> 
we
> 
can
> 
plan
> 
what
> 
kind
> 
of  
> location
> 
size
> 
we
> 
looking
> 
for.
> 
> http://upcoming.yahoo.com/event/420958/
> 
> Cheers,
> Stefan
> 
> 
> ~~~
> 101tec
> 
Inc.
> Menlo
> 
Park,
> 
California,
> 
USA
> http://www.101tec.com
> 
> 
> 




Re: Hadoop-2438

2008-01-26 Thread Otis Gospodnetic
Miles and Vadim - are you aware of the new Lucene sub-project, Mahout?  I think 
Grant Ingersoll mentioned it here the other day... 
http://lucene.apache.org/mahout/

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Vadim Zaliva <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, January 22, 2008 5:48:05 PM
Subject: Re: Hadoop-2438

On 
Jan 
22, 
2008, 
at 
14:44, 
Ted 
Dunning 
wrote:

I 
am 
also 
very 
interested 
in 
machine 
learning 
applications 
of 
MapReduce.
Collaborative 
Filtering 
in 
particular. 
If 
there 
are 
some 
lists/groups/ 
publications
related 
to 
this 
subject 
I 
will 
appreciate 
any 
pointers.

Sincerely,
Vadim

>
> 
I 
would 
love 
to 
talk 
more 
off-line 
about 
our 
efforts 
in 
this 
regard.
>
> 
I 
will 
send 
you 
email.
>
>
> 
On 
1/22/08 
2:21 
PM, 
"Miles 
Osborne" 
<[EMAIL PROTECTED]> 
wrote:
>
>> 
In 
my 
case, 
I'm 
using 
actual 
mappers 
and 
reducers, 
rather 
than  
>> 
shell 
script
>> 
commands.  
I've 
also 
used 
Map-Reduce 
at 
Google 
when 
I 
was 
on  
>> 
sabbatical
>> 
there 
in 
2006.
>>
>> 
That 
aside, 
I 
do 
take 
your 
point 
--you 
need 
to 
have 
a 
good 
grip 
on  
>> 
what 
Map
>> 
Reduce 
does 
to 
understand 
some 
of 
the 
challenges.  
Here 
at  
>> 
Edinburgh 
I'm
>> 
leading 
a 
little 
push 
to 
start 
doing 
some 
of 
our 
core 
research  
>> 
within 
this
>> 
environment.  
As 
a 
starter, 
I'm 
looking 
at 
the 
simple 
task 
of  
>> 
estimating
>> 
large 
n-gram 
based 
language 
models 
using 
M-R 
(think 
5-grams 
and  
>> 
upwards 
from
>> 
lots 
of 
web 
data).  
We 
are 
also 
about 
to 
look 
at 
core 
machine  
>> 
learning, 
such
>> 
as 
EM 
etc 
within 
this 
framework.  
So, 
lots 
of 
fun 
and 
games 
... 
and  
>> 
for 
me,
>> 
it 
is 
quite 
nice 
doing 
this 
kind 
of 
thing.  
A 
good 
break 
from 
the  
>> 
usual
>> 
research.
>>
>> 
Miles
>>
>> 
On 
22/01/2008, 
Ted 
Dunning 
<[EMAIL PROTECTED]> 
wrote:
>>>
>>>
>>>
>>> 
Streaming 
has 
some 
real 
conceptual 
confusions 
awaiting 
the 
unwary.
>>>
>>> 
For 
instance, 
if 
you 
implement 
line 
counting, 
a 
correct  
>>> 
implementation 
is
>>> 
this:
>>>
>>>  
  
stream 
-mapper 
cat 
-reducer 
'uniq 
-c'
>>>
>>> 
(stream 
is 
an 
alias 
I 
use 
to 
avoid 
typing 
hadoop 
-jar 
)
>>>
>>> 
It 
is 
tempting, 
though 
very 
dangerous 
to 
do
>>>
>>>  
  
stream 
-mapper 
'sort 
| 
uniq 
-c' 
-reducer 
'...add 
up 
counts...'
>>>
>>> 
But 
this 
doesn't 
work 
right 
because 
the 
mapper 
isn't 
to 
produce  
>>> 
output
>>> 
after
>>> 
the 
last 
input 
line.  
(it 
also 
tends 
to 
not 
work 
due 
to 
quoting  
>>> 
issues,
>>> 
but
>>> 
we 
can 
ignore 
that 
issue 
for 
the 
moment).  
A 
similar 
confusion  
>>> 
occurs 
when
>>> 
the 
mapper 
exits, 
even 
normally.  
Take 
the 
following 
program:
>>>
>>>  
  
stream 
-mapper 
'head 
-10' 
-reducer 
'...whatever...'
>>>
>>> 
Here 
the 
mapper 
exits 
after 
acting 
like 
the 
identity 
mapper 
for  
>>> 
the 
first
>>> 
ten 
input 
records 
and 
then 
exits.  
According 
to 
the 
implicit  
>>> 
contract, 
it
>>> 
should 
instead 
stick 
around 
and 
accept 
all 
subsequent 
inputs 
and 
not
>>> 
produce
>>> 
any 
output.
>>>
>>> 
The 
need 
for 
fairly 
deep 
understanding 
of 
how 
hadoop 
and 
how  
>>> 
normal 
shell
>>> 
processing 
idioms 
need 
to 
be 
modified 
makes 
streaming 
a 
pretty  
>>> 
tricky
>>> 
thing
>>> 
to 
use, 
especially 
for 
the 
map-reduce 
novice.
>>>
>>> 
I 
don't 
think 
that 
this 
problem 
can 
be 
easily 
corrected 
since 
it  
>>> 
is 
due 
to
>>> 
a
>>> 
fairly 
fundamental 
mismatch 
between 
shell 
programming 
tradition  
>>> 
and 
what 
a
>>> 
mapper 
or 
reducer 
is.
>>>
>>>
>>> 
On 
1/22/08 
8:48 
AM, 
"Joydeep 
Sen 
Sarma" 
<[EMAIL PROTECTED]>  
>>> 
wrote:
>>>
> 
My 
guess 
is 
that 
this 
is 
something 
to 
do 
with 
caching 
/ 
buffering,
>>> 
since 
I
> 
presume 
that 
when 
the 
Stream 
mapper 
has 
real 
work 
to 
do, 
the  
> 
associated
>>> 
Java
> 
streamer 
buffers 
input 
until 
the 
Mapper 
signals 
that 
it 
can  
> 
process
>>> 
more
> 
data.  
If 
the 
Mapper 
is 
busy, 
then 
a 
lot 
of 
data 
would 
get 
cached,
>>> 
causing
> 
some 
internal 
buffer 
to 
overflow.

 
unlikely. 
the 
java 
buffer 
would 
be 
fixed 
size. 
it 
would 
write 
to  
 
a 
unix
>>> 
pipe
 
periodically. 
if 
the 
streaming 
mapper 
is 
not 
consuming 
data 
- 
the  
 
java
>>> 
side
 
would 
quickly 
become 
blocked 
writing 
to 
this 
pipe.

 
the 
broken 
pipe 
case 
is 
extremely 
common 
and 
just 
tells 
that 
the  
 
mapper
>>> 
died.
 
best 
thing 
to 
do 
is 
find 
the 
stderr 
log 
for 
the 
task 
(from 
the
>>> 
jobtracker 
ui)
 
and 
find 
if 
the 
mapper 
left 
something 
there 
before 
dying.


 
if 
streaming 
gurus 
are 
reading 
this 
- 
i 
am 
curious 
about 
one  
 
unrelated
>>> 
thing 
-
 
the