Re: Reviving Nutch 0.7

2007-01-22 Thread Piotr Kosiorowski

Otis,
Some time ago people on the list said that they are willing to at
least maintain Nutch 0.7 branch. As a committer (not very active
recently) I volunteered to commit patches when they appear - I do not
have enough time at the moment to do active coding. I have created a
7.3 release in JIRA so we can start looking at it. So - we are ready
and willing to move Nutch 0.7 forward but it looks like there is no
interest at the moment.
Regards
Piotr

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?

Otis






Re: How to Become a Nutch Developer

2007-01-22 Thread Zaheed Haque

On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:



Well ... so far this process was very informal, because there were so
few key developers that they more or less knew what needs to be done,
and who is doing what.

Hadoop follows a much stricter and formalized model, which we could
adopt, since it apparently works well there. This should address the
issue of notifying others that the work is started on this or that item.


My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you to my
mind it brings more benefit to be structured/rigid for the newbie developer
cos you can follow every issue from start to end and all the comments in between
I have notice some of the mailing list questions/answers related to
issues for example
are not in Nutch JIRA so to follow an issue you have to
go-back-and-forth consult
mailing list and JIRA.

IMHO Nutch should adopt Hadoop model furthermore its probably to good idea to
discuss it further cos soon Nutch will have an 0.9 release and
probably its a good time to
change to Hadoop style :-)

Just some thoughts.

Cheers


Re: Reviving Nutch 0.7

2007-01-22 Thread Zaheed Haque

On 1/22/07, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hi,

I've been meaning to write this message for a while, and Andrzej's 
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, 
it will be even more valuable than it is today.  However, I think there is 
still a need for something much simpler, something like what Nutch 0.7 used to 
be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few 
developers to maintain and further develop both of these concepts, and the main 
Nutch developers need the more powerful version - 0.8 and beyond.  So, what is 
going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at 
least considering and discussing the possibility of somehow branching that 
version into a parallel project that's not just in a maintenance mode, but has 
its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?


I agree with you that there is a need for 0.7-style Nutch. I wouldn't
say reviving but more Disecting and re-directing :-). here you go
--- my focus here is 0.7 style i.e. mid-size, enterprise need.

Solr could use a good crawler cos it has everything else .. (AFAIK)
probably this is not technically plug an pray :-) also I am not sure
Solr community wants a crawler but it could benefit from such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7 plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there any interest.

Cheers


RE: Reviving Nutch 0.7

2007-01-22 Thread Alan Tanaman
Hello,

I'm writing this on behalf of both Armel Nene and myself. 

We think that you and those who have responded have a point.  We've been
experiencing quite a number of problems with getting Nutch 0.8 adapted for
our needs, and making changes to support evolving business requirements as
they come up.

So much so, that we've considered replacing the spine of Nutch with our
own programs, which would still be compatible with the Nutch plugins (same
parameters etc.), but that would allow us more ease in making changes and
debug.  We've decided to lay out some of our challenges for you to consider.
 
Our major needs are the ability to deploy on large enterprise file systems
(1-10 Terabytes, large compared to average file systems, but small compared
to the WWW).  We also need to support http, but only specific web sites,
subscription web sites and so on.  We don't need to replicate a
generic-Google implementation.

The main features we are currently working on relate primarily to
near-real-time crawling, specifically:
- Incremental Crawling, where changes are monitored at the folder level,
which is much faster than fetching every URL and checking for a change.
Note that this is similar to adaptive crawling, but will be even more
efficient.
- Special handling for parsing of large files (possibly farming those out to
dedicated processors a-la Amazon).  Hadoop would be useful here, but we
would consider re-adding this at a later stage.
- Incremental Indexing, where documents are added to or removed from a live
index, instead of rebuilding a new index each time.

We would be happy to join a group of 0.7 developers, if that would enable us
to pursue this enterprise-based direction, which clearly has different
challenges than those facing WWW-crawling.

Best regards,
Alan
_
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 22 January 2007 06:48
To: Nutch Developer List
Subject: Reviving Nutch 0.7

Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.
Nutch has too few developers to maintain and further develop both of these
concepts, and the main Nutch developers need the more powerful version - 0.8
and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth
at least considering and discussing the possibility of somehow branching
that version into a parallel project that's not just in a maintenance mode,
but has its own group of developers (not me, no time :( ) that pushes it
forward.

Thoughts?

Otis






Re: Fetcher2

2007-01-22 Thread chee wu
Fetcher2 should be a great help for me,but seems can't integrate with Nutch81.
Any advice on how to use it based on .81? 
- Original Message - 
From: Andrzej Bialecki [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 18, 2007 5:18 AM
Subject: Fetcher2


 Hi all,
 
 I just committed a new implementation of venerable fetcher, called 
 Fetcher2. It uses a producer/consumers model with a set of per-host 
 queues. Theoretically it should be able to achieve a much higher 
 throughput, especially for fetchlists with a lot of contention (many 
 urls from the same hosts).
 
 It should be possible to achieve the same fetching rate with a smaller 
 number of threads, and most importantly to avoid the dreaded Exceeded 
 http.max.delays: retry later error.
 
 It is available through bin/nutch fetch2.
 
 From the javadoc:
 
 A queue-based fetcher.
 
 This fetcher uses a well-known model of one producer (a QueueFeeder) and 
 many consumers (FetcherThread-s).
 
 QueueFeeder reads input fetchlists and populates a set of 
 FetchItemQueue-s, which hold FetchItem-s that describe the items to be 
 fetched. There are as many queues as there are unique hosts, but at any 
 given time the total number of fetch items in all queues is less than a 
 fixed number (currently set to a multiple of the number of threads).
 
 As items are consumed from the queues, the QueueFeeder continues to add 
 new input items, so that their total count stays fixed (FetcherThread-s 
 may also add new items to the queues e.g. as a results of redirection) - 
 until all input items are exhausted, at which point the number of items 
 in the queues begins to decrease. When this number reaches 0 fetcher 
 will finish.
 
 This fetcher implementation handles per-host blocking itself, instead of 
 delegating this work to protocol-specific plugins. Each per-host queue 
 handles its own politeness settings, such as the maximum number of 
 concurrent requests and crawl delay between consecutive requests - and 
 also a list of requests in progress, and the time the last request was 
 finished. As FetcherThread-s ask for new items to be fetched, queues may 
 return eligible items or null if for politeness reasons this host's 
 queue is not yet ready.
 
 If there are still unfetched items on the queues, but none of the items 
 are ready, FetcherThread-s will spin-wait until either some items become 
 available, or a timeout is reached (at which point the Fetcher will 
 abort, assuming the task is hung).
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 


Re: Reviving Nutch 0.7

2007-01-22 Thread Sami Siren

2007/1/22, Otis Gospodnetic [EMAIL PROTECTED]:


Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm
this.  Nutch has too few developers to maintain and further develop both of
these concepts, and the main Nutch developers need the more powerful version
- 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be
worth at least considering and discussing the possibility of somehow
branching that version into a parallel project that's not just in a
maintenance mode, but has its own group of developers (not me, no time :( )
that pushes it forward.

Thoughts?



Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
please consider the following:

One option would be re factoring the code in a way that the parts that are
usable to other projects like protocols?, parsers (this actually was
proposed by
Jukka Zitting some time last year) and stuff would be modified to be
independent
of nutch (and hadoop) code. Yeah, this is easy to say, but would require
significant amount of work.

The more focused,smaller chunks of nutch would probably also get bigger
audience (perhaps also outside nutch land) and that way perhaps more people
willing to work for them.

Don't know about others but at least I would be more willing to work towards
this goal than the one where there would be practically many separate
projects,
each sharing common functionality but different code base.

--
Sami Siren


Re: Reviving Nutch 0.7

2007-01-22 Thread Chris Mattmann
 
 Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
 please consider the following:
 
 One option would be re factoring the code in a way that the parts that are
 usable to other projects like protocols?, parsers (this actually was
 proposed by
 Jukka Zitting some time last year) and stuff would be modified to be
 independent
 of nutch (and hadoop) code. Yeah, this is easy to say, but would require
 significant amount of work.
 
 The more focused,smaller chunks of nutch would probably also get bigger
 audience (perhaps also outside nutch land) and that way perhaps more people
 willing to work for them.
 
 Don't know about others but at least I would be more willing to work towards
 this goal than the one where there would be practically many separate
 projects,
 each sharing common functionality but different code base.

+1 ;)

This was actually the project proposed by Jerome Charron and myself, called
Tika. We went so far as to create a project proposal, and send it out to
the nutch-dev list, as well as the Lucene PMC for potential Lucene
sub-project goodness. I could probably dig up the proposal should the need
arise.

Good ol' Jukka then took that effort and created us a project within Google
code, that still lives in there in fact:

http://code.google.com/p/tika/

There hasn't be active development on it because:

1. None of us (I'm speaking for Jerome, and myself here) ended up having the
time to shepherd it going forward

2. There was little, if any response, from the proposal to the nutch-dev
list, and folks willing to contribute (besides people like Jukka)

3. I think, as you correctly note above, most people thought it to be too
much of a Herculean effort that wouldn't pay the necessary dividends in the
end to undertake it


In any case, I think that, if we are going to maintain separate branches of
the source, in fact, really parallel projects, then an undertaking such as
Tika is properly needed ...

Cheers,
   Chris




 
 --
  Sami Siren




Re: Reviving Nutch 0.7

2007-01-22 Thread Sami Siren
Chris Mattmann wrote:
 In any case, I think that, if we are going to maintain separate branches of
 the source, in fact, really parallel projects, then an undertaking such as
 Tika is properly needed ...

I still don't think we need separate project to start with, IMO right
mode of mind is enough to get going. If people thing this is right
direction and it goes beyond talk then perhaps after that we could start
talking about separate project.


--
 Sami Siren




Re: How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes
Thanks to everyone for the input.  I know some of these questions are 
obvious but I wanted to take it from the lowest possible level.


Part of the document is already posted to the wiki here.

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

It seems like I am getting a section done each night so everything 
should be done it a couple of days.


Dennis Kubes

Chris Mattmann wrote:

Hi Dennis,


On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote:


All,

I am working on a How to Become a Nutch Developer document for the
wiki and I need some input.

I need an overview of how the process for JIRA works?  If I am a
developer new to Nutch and just starting to look at the JIRA and I want
to start working on some piece of functionality or to help with bug
fixes where would I look.


JIRA provides a lot of search facilities: it's actually kind of nice. The
starting point for browsing bugs and other types of issues is:

http://issues.apache.org/jira/browse/NUTCH

(in general, for all Apache projects that use JIRA, you'll find that their
issue tracking system boils down to:

http://issues.apache.org/jira/browse/APACHE_PROJ_JIRA_ID
)

From there, you can access canned filters for open issues like:
Blocker
Critical
Major
Minor
Trivial

For more detailed search capabilities, click on the Find Issues button at
the top breadcrumb bar. Search capabilities there include the ability to
look for issues by developer, status, issue type, and to combine such fields
using AND, and OR. Additionally, you can issue a free text query across all
issues by using the free text box there.


Would I just choose something that is unscheduled and begin working on it?


That's a good starting point: additionally, high priority issues marked as
Blockers, Critical and Major are always good because the sooner we
(the committers) get a patch for those, the sooner we'll be testing it for
inclusion into the sources.


What if I see something that I want to work on but it is scheduled to
somebody else?


Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you
don't have to do that. ;) Just speak up on the mailing list, and volunteer
your support. One of the people listed in the group nutch-developers in
JIRA (e.g., the committers) can reassign the issue to you so long as the
other gent it was assigned to doesn't mind...


Are items only scheduled to committers or can they be scheduled to
developers as well?  If they can be scheduled to regular developers how
does someone get their name on the list to be scheduled items?


Items can be scheduled to folks listed in the nutch-developers group within
JIRA. Most of these folks are the committers, however, not all of them are.
I'm not entirely sure how folks get into that group (maybe Doug?), however,
that's the real criteria for having a JIRA issue officially assigned to you.
However, that doesn't mean that you can't work on things in lieu of that. If
there's an issue that you'd like to contribute to, please, prepare a patch,
attach it to JIRA, and then speak up on the mailing list. Chances are, with
the recent busy schedules of the committers (including myself) besides Sami,
and Andrzej, the committers don't have time to prepare patches for the issue
assigned to them. If you contribute a great patch, the committer will pick
it up, test it, apply it, and you'll get the same effect as if the issue
were directly assigned to you.

Should I submit a JIRA and/or notify the list before I start working on
something?  What is the common process for this?


Yup, that's pretty much it. Voice your desire to work on a particular task
on the nutch-dev list. Many of the developers on that list have been around
for a while now, and they know what's been discussed, and implemented
before.

When I submit a JIRA is there anything else I need to do either in the
JIRA system or with the mailing lists, committers, etc?


Nope: the nutch-dev list is automatically notified by all JIRA issue
submissions, and the committers (and rest of the folks) will pick up on this
and act accordingly.


Getting this information together in one place will go a long way toward
helping others to start contributing more and more.  Thanks for all your
input.


No probs, glad to be of service :-)

Cheers,
  Chris


Dennis Kubes





Re: Fetcher2

2007-01-22 Thread Andrzej Bialecki

chee wu wrote:

Fetcher2 should be a great help for me,but seems can't integrate with Nutch81.
Any advice on how to use it based on .81? 
  


You would have to port it to Nutch 0.8.1 - e.g. change all Text 
occurences to UTF8, and most likely make other changes too ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: How to Become a Nutch Developer

2007-01-22 Thread Andrzej Bialecki

Dennis Kubes wrote:
What does the Hadoop project do differently than Nutch.  I thought 
they both were run about the same way?  Is it that all communication 
on issues goes through the JIRA?


The workflow is different - I'm not sure about the details, perhaps Doug 
can correct me if I'm wrong ... and yes, it uses JIRA extensively.


1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked 
Patch available.
4. An automated process applies the patch to a temporary copy, and 
checks whether it compiles and passes junit tests.
5. A list of patches in this state is available, and committers may pick 
from this list and apply them.
6. An explicit link is made between the issue and the change set 
committed to svn (Is this automated?)
7. The issue is marked as Resolved, but not closed. I believe issues 
are closed only when a release is made, because issues in state 
resolved make up the Changelog. I believe this is also automated.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

2007-01-22 Thread Brian Whitman

On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:


However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.


It's HADOOP-331, so i guess at least the changes/additions in map() is
required.


Hi, just following up here-- does this indicate that if I get a  
hadoop nightly that was patched for HADOOP-331 and have Nutch use it,  
the EOFException will go away in the latest nightlies?


I tried that, it now crashes in a different spot, during fetching:

2007-01-22 11:34:53,051 INFO  mapred.LocalJobRunner - 1 pages, 0  
errors, 1.0 pages/s, 20 kb/s,

2007-01-22 11:34:53,134 WARN  mapred.LocalJobRunner - job_yzavye
java.lang.NoSuchMethodError: org.apache.hadoop.io.MapFile 
$Writer.init(Lorg/apache/hadoop/fs/FileSystem;Ljava/lang/ 
String;Ljava/lang/Class;Ljava/lang/Class;)V
at  
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter 
(FetcherOutputFormat.java:58)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:303)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:137)
2007-01-22 11:34:53,398 FATAL fetcher.Fetcher - Fetcher:  
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
441)

at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)





Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting

Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps Doug 
can correct me if I'm wrong ... and yes, it uses JIRA extensively.


1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked 
Patch available.


Patch Available is code for the contributor now believes this is 
ready to commit.  Once a patch is in this state, a committer reviews it 
and either commits it or rejects it, changing the state of the issue 
back to Open.  The set of issues in Patch Available thus forms a 
work queue for committers.  We try not to let a patch sit in this state 
for more than a few days.


4. An automated process applies the patch to a temporary copy, and 
checks whether it compiles and passes junit tests.


This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't 
be hard to run this for Nutch on lucene.zones.apache.org, and I think 
Nigel would probably gladly share his scripts.  This step saves 
committers time: if a patch doesn't pass unit tests, or has javadoc 
warnings, etc. this can be identified automatically.


5. A list of patches in this state is available, and committers may pick 
from this list and apply them.
6. An explicit link is made between the issue and the change set 
committed to svn (Is this automated?)


Jira does this based on commit messages.  Any bug ids mentioned in a 
commit message create links from that bug to the revision in subversion. 
 Hadoop commits messages usually start with the bug id, e.g., 
HADOOP-1234.  Remove a deadlock in the oscillation overthruster.


7. The issue is marked as Resolved, but not closed. I believe issues 
are closed only when a release is made, because issues in state 
resolved make up the Changelog. I believe this is also automated.


Jira will put resolved issues into the release notes regardless of 
whether they're closed.  The reason we close issues on release is to 
keep folks from re-opening them.  We want the release notes to be the 
list of changes in a release, so we don't want folks re-opening issues 
and having new commits made against them, since then the changes related 
to the issue will span multiple releases.  If an issue is closed but 
there's still a problem, a new issue should be created linking to the 
prior issue, so that the new issue can be scheduled and tracked without 
modifying what should be a read-only release.


Doug




Re: Reviving Nutch 0.7

2007-01-22 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Yes, certainly, anything that can be shared and decoupled from pieces that make 
each branch (not SVN/CVS branch) different, should be decoupled.  But I was 
really curious about whether people think this is a valid idea/direction, not 
necessarily immediately how things should be implemented.  In my mind, one 
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, 
etc.  That's the branch that's in the trunk.  The other branch is a simpler 
branch without all that Hadoop stuff, for folks who need to fetch, index, and 
search a few hundred thousand or a few million or even a few tens of millions 
of pages, and don't need replication, etc. that comes with Hadoop.  That branch 
could be based off of 0.7.  I also know that a lot of people are trying to use 
Nutch to build vertical search engines, so there is also a need for a focused 
fetcher.  Kelvin Tan brought this up a few times, too, I believe.


Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch: 
everything should run fine in a single process by default.  If there are 
bugs in this they should be logged, folks who care should submit 
high-quality, back-compatible, generally useful patches, and committers 
should work to get these patches committed to the trunk.


Second, if there are to be two modes of operation, wouldn't they best be 
developed in a common source tree, so that they share as much as 
possible and diverge as little as possible?  It seems to me that a good 
architecture would be to agree on a common high-level API, then use two 
different runtimes underneath, one to support distributed operation, and 
one to support standalone operation.  Hey!  That's what Hadoop already 
does!  Maybe it's not perfect and someone can propose a better way to 
share maximal amounts of code, but the code split should probably be 
into different classes and packages in a single source tree maintained 
by a single community of developers, not by branching a single source 
tree in a revision control and splitting the developers.


Third, part of the problem seems like there are two few 
contributors--that the challenges are big and the resources limited. 
Splitting the project will only spread those resources more thinly.


What really is the issue here?  Are good patches languishing?  Are there 
patches that should be committed (meet coding standards, are 
back-compatible, generally useful, etc.) but are not?  A great patch is 
one that a committer can commit it with few worries: it includes new 
unit tests, it passes all existing unit tests, it fixes one thing only, 
etc.  Such patches should not have to wait long for commit.  And once 
someone submits a few such patches, then one should be invited to become 
a committer.


It sounds to me like the problem is that, off-the-shelf, Nutch does not 
yet solve all the problems folks would like it too: e.g., it has never 
done a good job with incremental indexing.  Folks see progress made on 
scalability, but really wish it were making more progress on 
incrementality or something else.  But it's not going to make progress 
on incrementality without someone doing the work.  A fork or a branch 
isn't going to do the work.  I don't see any reason that the work cannot 
be done right now.  It can be done incrementally: e.g., if the web db 
API seems inappropriate for incremental updates, then someone should 
submit a patch that provides an incremental web db API, updating the 
fetcher and indexer to use this.  A design for this on the wiki would be 
a good place to start.


Finally, web crawling, indexing and searching are data-intensive. 
Before long, users will want to index tens or hundreds of millions of 
pages.  Distributed operation is soon required at this scale, and 
batch-mode is an order-of-magnitude faster.  So be careful before you 
threw those features out: you might want them back soon.


Doug




Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

2007-01-22 Thread Sami Siren
Brian Whitman wrote:
 On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:
 
 However I cannot find from the change logs of hadoop that what the
 change is that is causing nutch these problems.

 It's HADOOP-331, so i guess at least the changes/additions in map() is
 required.
 
 Hi, just following up here-- does this indicate that if I get a hadoop
 nightly that was patched for HADOOP-331 and have Nutch use it, the
 EOFException will go away in the latest nightlies?

No, I mean that HADOOP-331 is the change that is _causing_ these, so we
need to adapt nutch code to coop with the change in sorting.

Is there somebody that can tell me  why the various utilities (like
Indexer) is doing the wrapping to ObjectWritable in InputFormat and not
in Mapper.map in the first place? Is this optimization of some kind?

--
 Sami Siren


Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

2007-01-22 Thread Andrzej Bialecki

Sami Siren wrote:

Brian Whitman wrote:
  

On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:



However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.


It's HADOOP-331, so i guess at least the changes/additions in map() is
required.
  

Hi, just following up here-- does this indicate that if I get a hadoop
nightly that was patched for HADOOP-331 and have Nutch use it, the
EOFException will go away in the latest nightlies?



No, I mean that HADOOP-331 is the change that is _causing_ these, so we
need to adapt nutch code to coop with the change in sorting.

Is there somebody that can tell me  why the various utilities (like
Indexer) is doing the wrapping to ObjectWritable in InputFormat and not
in Mapper.map in the first place? Is this optimization of some kind?
  


This is a legacy from the (very recent) times when you had to set a 
key/value class of the InputFormat in your mapred job. You don't have to 
do this now - it's handled transparently by 
InputFormat.getRecordReader().createKey() and createValue().


In fact, there's a lot of this cruft left over in Nutch. We should also 
use GenericWritable in most of these places, and indeed we could wrap 
the values in Mapper.map().


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes

+1 for adopting the same types of process with Nutch.

Doug Cutting wrote:

Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps 
Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively.


1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked 
Patch available.


Patch Available is code for the contributor now believes this is 
ready to commit.  Once a patch is in this state, a committer reviews it 
and either commits it or rejects it, changing the state of the issue 
back to Open.  The set of issues in Patch Available thus forms a 
work queue for committers.  We try not to let a patch sit in this state 
for more than a few days.


4. An automated process applies the patch to a temporary copy, and 
checks whether it compiles and passes junit tests.


This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't 
be hard to run this for Nutch on lucene.zones.apache.org, and I think 
Nigel would probably gladly share his scripts.  This step saves 
committers time: if a patch doesn't pass unit tests, or has javadoc 
warnings, etc. this can be identified automatically.


5. A list of patches in this state is available, and committers may 
pick from this list and apply them.
6. An explicit link is made between the issue and the change set 
committed to svn (Is this automated?)


Jira does this based on commit messages.  Any bug ids mentioned in a 
commit message create links from that bug to the revision in subversion. 
 Hadoop commits messages usually start with the bug id, e.g., 
HADOOP-1234.  Remove a deadlock in the oscillation overthruster.


7. The issue is marked as Resolved, but not closed. I believe issues 
are closed only when a release is made, because issues in state 
resolved make up the Changelog. I believe this is also automated.


Jira will put resolved issues into the release notes regardless of 
whether they're closed.  The reason we close issues on release is to 
keep folks from re-opening them.  We want the release notes to be the 
list of changes in a release, so we don't want folks re-opening issues 
and having new commits made against them, since then the changes related 
to the issue will span multiple releases.  If an issue is closed but 
there's still a problem, a new issue should be created linking to the 
prior issue, so that the new issue can be scheduled and tracked without 
modifying what should be a read-only release.


Doug




Re: Reviving Nutch 0.7

2007-01-22 Thread AJ Chen

On 1/22/07, Doug Cutting [EMAIL PROTECTED] wrote:



Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug


As a developer building application on top of Nutch, my experience is that

I can't go back to version 0.7x because the features in version 0.8/0.9 are
so much needed even for non-distributed crawling/indexing. For example, I
can run crawling/indexing on a linux server and a windows laptop separately,
and merge newly crawled databases into the main crawldb. I remember
v0.7can't merge separate crawldb without lots of customization.

It may takes some time to switch from 0.7x to v0.8/0.9 especially if you
have lots of customization code. But, once you get over this one hurdle, you
will enjoy the new and better features in 0.8/0.9 version.  Also, this may
be the time to re-think the design of your application. For my own project,
I always try to separate my code from nutch core code as much as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.

AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org


Re: How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes

Doug

Can you answer the question of how to add developer names to JIRA or if 
that is only for committers?


Dennis

Doug Cutting wrote:

Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps 
Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively.


1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked 
Patch available.


Patch Available is code for the contributor now believes this is 
ready to commit.  Once a patch is in this state, a committer reviews it 
and either commits it or rejects it, changing the state of the issue 
back to Open.  The set of issues in Patch Available thus forms a 
work queue for committers.  We try not to let a patch sit in this state 
for more than a few days.


4. An automated process applies the patch to a temporary copy, and 
checks whether it compiles and passes junit tests.


This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't 
be hard to run this for Nutch on lucene.zones.apache.org, and I think 
Nigel would probably gladly share his scripts.  This step saves 
committers time: if a patch doesn't pass unit tests, or has javadoc 
warnings, etc. this can be identified automatically.


5. A list of patches in this state is available, and committers may 
pick from this list and apply them.
6. An explicit link is made between the issue and the change set 
committed to svn (Is this automated?)


Jira does this based on commit messages.  Any bug ids mentioned in a 
commit message create links from that bug to the revision in subversion. 
 Hadoop commits messages usually start with the bug id, e.g., 
HADOOP-1234.  Remove a deadlock in the oscillation overthruster.


7. The issue is marked as Resolved, but not closed. I believe issues 
are closed only when a release is made, because issues in state 
resolved make up the Changelog. I believe this is also automated.


Jira will put resolved issues into the release notes regardless of 
whether they're closed.  The reason we close issues on release is to 
keep folks from re-opening them.  We want the release notes to be the 
list of changes in a release, so we don't want folks re-opening issues 
and having new commits made against them, since then the changes related 
to the issue will span multiple releases.  If an issue is closed but 
there's still a problem, a new issue should be created linking to the 
prior issue, so that the new issue can be scheduled and tracked without 
modifying what should be a read-only release.


Doug




Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting

Dennis Kubes wrote:
Can you answer the question of how to add developer names to JIRA or if 
that is only for committers?


It's not just for committers, but also for regular contributors.  I have 
added you.  Anyone else?


Doug


Finished How to Become a Nutch Developer

2007-01-22 Thread nutch-dev
All,

Draft version of How to Become a Nutch Developer is on the wiki at:

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

Please take a look and if you think anything needs to be added, removed,
or changed let me know.

Dennis Kubes