Re: The Future of Nutch, reactivated

2009-05-23 Thread Julien Nioche
Hi,

Am joining the conversation a bit late but nevermind...

In my views the main targets should be (2). As you pointed out, SOLR covers
(3) and (4) quite well (or will progressively do so). As for (1), there is
definitely an audience even if it is small but would certainly benefit from
the work done towards (2). As you said, operating on a large scale (i.e
using more than 100 slaves) requires a lot of resources and a dedicated team
and I expect that the people interested in large scale would have their own
views on scoring and spam prevention anyway :-)

I completely agree that there should be as much delegation of
functionalities to third-parties as possible (e.g. parsing with Tika) in
order to focus on the core competences.
I really like your idea of doing template detection for instance. Another
thing I found promising is the HBase integration (NUTCH-650), which would
also allow more interoperability with other tools such as Heritrix and make
the data structure a bit more open.

Talking about future functionalities, we do quite a lot of text analysis
with tools like Gate or UIMA and have been working on things such as
detection of adult content and automatic text classification with Nutch.
There are plenty of interesting things that can be done for vertical search
systems, such as Named Entity Extraction etc... Since NLP applications can
be quite greedy, leveraging Hadoop is definitely an advantage. I'd love to
see in the future versions of Nutch a separation between Format Parsing (i.e
Tika) and content analysis, where implementations would get a
semi-structured representation of the documents a bit like what extensions
of HTML parsers are getting currently, but regardless of the original
format.

Have a good week end

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/5/14 Andrzej Bialecki a...@getopt.org

 Hi all,

 I'd like to revive this thread and gather additional feedback so that we
 end up with concrete conclusions. Much of what I write below others have
 said before, I'm trying here to express this as it looks from my point of
 view.

 Target audience
 ===
 I think that the Nutch project experiences a crisis of personality now - we
 are not sure what is the target audience, and we cannot satisfy everyone. I
 think that there are following groups of Nutch users:

 1. Large-scale Internet crawl  search: actually, there are only few
 such users, because it takes considerable resources to manage operations
 on that scale. Scalability, manage-ability and ranking/spam prevention are
 the chief concerns here.

 2. Medium-scale vertical search: I suspect that many Nutch users fall into
 this category. Modularity, flexibility in implementing custom processing,
 ability to modify workflows and to use only some Nutch components seem to be
 chief concerns here. Scalability too, but only up to a volume of ~100-200
 mln documents.

 3. Small- to medium-scale enterprise search: there's a sizeable number of
 Nutch users that fall into this category, for historical reasons. Link-based
 ranking and resource discovery are not that important here, but integration
 with Windows networking, Microsoft formats and databases , as well as
 realtime indexing and easy index maintenance are crucial. This class of
 users often has to heavily customize Nutch to get any sensible result. Also,
 this is where Solr really shines, so there is little benefit in using Nutch
 here. I predict that Nutch will have fewer and fewer users of this type.

 4. Single desktop to small intranet search: as above, but the accent is on
 the ease of use out of the box, and an often requested feature is a GUI
 frontend. Currently IMHO Nutch is too complex and requires too much
 command-line operation for casual users to make this use case attractive.

 What is the target audience that we as a community want to support? By this
 I mean not only the moral support, but also active participation in the
 development process. From the place where we are at the moment we could go
 in any of the above directions.

 Core competence
 ===
 This is a simple but important point. Currently we maintain several major
 subsystems in Nutch that are implemented by other projects, and often in a
 better way. Plugin framework (and dependency injection) and content parsing
 are two areas that we have to delegate to third-party libraries, such as
 Tika and OSGI or some other simple IOC container - probably there are other
 components that we don't have to do ourselves. Another thing that I'd love
 to delegate is the distributed search and index maintenance - either through
 Solr or Katta or something else.

 The question then is, what is the core competence of this project? I see
 the following major areas that are unique to Nutch:

 * crawling - this includes crawl scheduling (and re-crawl scheduling),
 discovery and classification of new resources, strategies for crawling
 specific sets of URLs (hosts and domains) under 

Re: The Future of Nutch, reactivated

2009-05-15 Thread Raymond Balmès
I 'm still a new user so although I found it rather easy to get going and
build my own plugin's I have some suggestions.

Yes one thing that I'd like to see is a kind of way to estimate how long
will a certain step (fetch, ...)  will take... something like a progress
bar. Because you launch a step and it can go on for days without knowing it
and perfectly working but still you have no idea when it might eventually
end.

I find the WEB front end rather difficult to change and I lost a lot of time
with the NucthBean for understanding how it works.
Coming from Lucene it took me a while to find out all the limitations it
has. So I haven't played much with NutchSolr integration but from the sound
of it looks more powerfull, simpler that is my concern.
-Raymond-
2009/5/14 Mattmann, Chris A chris.a.mattm...@jpl.nasa.gov

 Hi Andrzej,

 Great summary. My general feeling on this is similar to my prior comments
 on
 similar threads from Otis and from Dennis. My personal pet projects for
 Nutch2:

 * refactored Nutch core data structures, modeled as POJOs
 * refactored Nutch architecture where
 crawling/indexing/parsing/scoring/etc.
 are insulated from the underlying messaging substrate (e.g., crawl over
 JMS,
 EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
 framework, etc.)
 * simpler Nutch deployment mechanisms (separate Nutch deployment package
 from source code package), think about using Maven2

 +1 to all of those and other ideas for how to improve the project's focus.

 Cheers,
 Chris


 On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote:

  Hi all,
 
  I'd like to revive this thread and gather additional feedback so that we
  end up with concrete conclusions. Much of what I write below others have
  said before, I'm trying here to express this as it looks from my point
  of view.
 
  Target audience
  ===
  I think that the Nutch project experiences a crisis of personality now -
  we are not sure what is the target audience, and we cannot satisfy
  everyone. I think that there are following groups of Nutch users:
 
  1. Large-scale Internet crawl  search: actually, there are only few
  such users, because it takes considerable resources to manage operations
  on that scale. Scalability, manage-ability and ranking/spam prevention
  are the chief concerns here.
 
  2. Medium-scale vertical search: I suspect that many Nutch users fall
  into this category. Modularity, flexibility in implementing custom
  processing, ability to modify workflows and to use only some Nutch
  components seem to be chief concerns here. Scalability too, but only up
  to a volume of ~100-200 mln documents.
 
  3. Small- to medium-scale enterprise search: there's a sizeable number
  of Nutch users that fall into this category, for historical reasons.
  Link-based ranking and resource discovery are not that important here,
  but integration with Windows networking, Microsoft formats and databases
  , as well as realtime indexing and easy index maintenance are crucial.
  This class of users often has to heavily customize Nutch to get any
  sensible result. Also, this is where Solr really shines, so there is
  little benefit in using Nutch here. I predict that Nutch will have fewer
  and fewer users of this type.
 
  4. Single desktop to small intranet search: as above, but the accent is
  on the ease of use out of the box, and an often requested feature is a
  GUI frontend. Currently IMHO Nutch is too complex and requires too much
  command-line operation for casual users to make this use case attractive.
 
  What is the target audience that we as a community want to support? By
  this I mean not only the moral support, but also active participation in
  the development process. From the place where we are at the moment we
  could go in any of the above directions.
 
  Core competence
  ===
  This is a simple but important point. Currently we maintain several
  major subsystems in Nutch that are implemented by other projects, and
  often in a better way. Plugin framework (and dependency injection) and
  content parsing are two areas that we have to delegate to third-party
  libraries, such as Tika and OSGI or some other simple IOC container -
  probably there are other components that we don't have to do ourselves.
  Another thing that I'd love to delegate is the distributed search and
  index maintenance - either through Solr or Katta or something else.
 
  The question then is, what is the core competence of this project? I see
  the following major areas that are unique to Nutch:
 
  * crawling - this includes crawl scheduling (and re-crawl scheduling),
  discovery and classification of new resources, strategies for crawling
  specific sets of URLs (hosts and domains) under bandwidth and netiquette
  constraints, etc.
 
  * web graph analysis - this includes link-based ranking, mirror
  detection (and URL aliasing) but also link spam detection and a more
  complex control 

Re: The Future of Nutch, reactivated

2009-05-15 Thread consultas

Keep it simple.
Many people, it seems to me, use nutch to exercise, in some way their 
programming expertise and talents.
I am just a user, and I think that users just want something thant can index 
the web and find results, when they search.  I don't want to deal with 
complicated application names, I just want to crawl and search.  And, it 
should be noted that for most of the users, myself included,  it is not a 
trivial job to get Nutch working, in Linux or Windows.
Anyway, I think that the bigest use for Nutch will be for vertical or 
regional search purposes.
By the way, from this point of view i really didn't like the experience with 
the original release of version of 1.0.  Too slow the crawling phase.



- Original Message - 
From: Andrzej Bialecki a...@getopt.org

To: nutch-user@lucene.apache.org
Sent: Thursday, May 14, 2009 10:45 AM
Subject: The Future of Nutch, reactivated



Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.

Target audience
===
I think that the Nutch project experiences a crisis of personality now -
we are not sure what is the target audience, and we cannot satisfy
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl  search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall
into this category. Modularity, flexibility in implementing custom
processing, ability to modify workflows and to use only some Nutch
components seem to be chief concerns here. Scalability too, but only up
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number
of Nutch users that fall into this category, for historical reasons.
Link-based ranking and resource discovery are not that important here,
but integration with Windows networking, Microsoft formats and databases
, as well as realtime indexing and easy index maintenance are crucial.
This class of users often has to heavily customize Nutch to get any
sensible result. Also, this is where Solr really shines, so there is
little benefit in using Nutch here. I predict that Nutch will have fewer
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is
on the ease of use out of the box, and an often requested feature is a
GUI frontend. Currently IMHO Nutch is too complex and requires too much
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By
this I mean not only the moral support, but also active participation in
the development process. From the place where we are at the moment we
could go in any of the above directions.

Core competence
===
This is a simple but important point. Currently we maintain several
major subsystems in Nutch that are implemented by other projects, and
often in a better way. Plugin framework (and dependency injection) and
content parsing are two areas that we have to delegate to third-party
libraries, such as Tika and OSGI or some other simple IOC container -
probably there are other components that we don't have to do ourselves.
Another thing that I'd love to delegate is the distributed search and
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),
discovery and classification of new resources, strategies for crawling
specific sets of URLs (hosts and domains) under bandwidth and netiquette
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror
detection (and URL aliasing) but also link spam detection and a more
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link
analysis package and pluggable FetchSchedule and Signature. A lot
remains to be done here, and we are still spending a lot of resources on
dealing with issues outside this core competence.

---

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community
of users, contributors and committers, so that the project is most
useful to our target audience. At this point there are few active
committers, so I don't think we can cover more than 1 direction at a
time ... ;)

* we need to re-architect Nutch to focus on 

Re: The Future of Nutch, reactivated

2009-05-14 Thread AJ Chen
Andrzej, great summary. I played with nutch before for web search engine,
but has not used it for a while because it has become too complicated. based
on my experience in building semantic search engine for healthcare vertical,
it think it would be benefitial to separate crawling from search
architecturaly and focus on just crawling for nutch.

My sense is that, if nutch can make crawling simple and deliver high-quality
crawled contents along with important metadata like link structure, it will
have much better chance to become an indispensable part of search engine. Of
course, it's important to include an implementation for search as well so
that nutch can provide end-to-end (i.e. crawl and search) results for
evaluation.  but, don't get stuck in search because there are a variety of
different search needs, such as static search, dynamic search, real time
search, semantic search, etc. it's not easy to make nutch to meet all of
these real-world needs. rather, nutch should provide the crawled contents in
a way that people can easily apply different search tools or search
technology.

As for the audience, it makes sense to focus on the middle of the usage
spectrum, ie. vertical search or focusd search in mid-range scale. but, I
won't ignore the small projects or developer projects because this is often
the start point for new project evaluation.

-aj
-- 
AJ Chen, PhD
Co-Chair, Semantic Web SIG, sdforum.org
Technical Architect, healthline.com
http://web2express.org
Palo Alto, CA

On Thu, May 14, 2009 at 6:45 AM, Andrzej Bialecki a...@getopt.org wrote:

 Hi all,

 I'd like to revive this thread and gather additional feedback so that we
 end up with concrete conclusions. Much of what I write below others have
 said before, I'm trying here to express this as it looks from my point of
 view.

 Target audience
 ===
 I think that the Nutch project experiences a crisis of personality now - we
 are not sure what is the target audience, and we cannot satisfy everyone. I
 think that there are following groups of Nutch users:

 1. Large-scale Internet crawl  search: actually, there are only few
 such users, because it takes considerable resources to manage operations
 on that scale. Scalability, manage-ability and ranking/spam prevention are
 the chief concerns here.

 2. Medium-scale vertical search: I suspect that many Nutch users fall into
 this category. Modularity, flexibility in implementing custom processing,
 ability to modify workflows and to use only some Nutch components seem to be
 chief concerns here. Scalability too, but only up to a volume of ~100-200
 mln documents.

 3. Small- to medium-scale enterprise search: there's a sizeable number of
 Nutch users that fall into this category, for historical reasons. Link-based
 ranking and resource discovery are not that important here, but integration
 with Windows networking, Microsoft formats and databases , as well as
 realtime indexing and easy index maintenance are crucial. This class of
 users often has to heavily customize Nutch to get any sensible result. Also,
 this is where Solr really shines, so there is little benefit in using Nutch
 here. I predict that Nutch will have fewer and fewer users of this type.

 4. Single desktop to small intranet search: as above, but the accent is on
 the ease of use out of the box, and an often requested feature is a GUI
 frontend. Currently IMHO Nutch is too complex and requires too much
 command-line operation for casual users to make this use case attractive.

 What is the target audience that we as a community want to support? By this
 I mean not only the moral support, but also active participation in the
 development process. From the place where we are at the moment we could go
 in any of the above directions.

 Core competence
 ===
 This is a simple but important point. Currently we maintain several major
 subsystems in Nutch that are implemented by other projects, and often in a
 better way. Plugin framework (and dependency injection) and content parsing
 are two areas that we have to delegate to third-party libraries, such as
 Tika and OSGI or some other simple IOC container - probably there are other
 components that we don't have to do ourselves. Another thing that I'd love
 to delegate is the distributed search and index maintenance - either through
 Solr or Katta or something else.

 The question then is, what is the core competence of this project? I see
 the following major areas that are unique to Nutch:

 * crawling - this includes crawl scheduling (and re-crawl scheduling),
 discovery and classification of new resources, strategies for crawling
 specific sets of URLs (hosts and domains) under bandwidth and netiquette
 constraints, etc.

 * web graph analysis - this includes link-based ranking, mirror detection
 (and URL aliasing) but also link spam detection and a more complex control
 over the crawling frontier.

 Anything more? I'm not sure - perhaps I would add template 

Re: The Future of Nutch, reactivated

2009-05-14 Thread Mattmann, Chris A
Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote:

 Hi all,
 
 I'd like to revive this thread and gather additional feedback so that we
 end up with concrete conclusions. Much of what I write below others have
 said before, I'm trying here to express this as it looks from my point
 of view.
 
 Target audience
 ===
 I think that the Nutch project experiences a crisis of personality now -
 we are not sure what is the target audience, and we cannot satisfy
 everyone. I think that there are following groups of Nutch users:
 
 1. Large-scale Internet crawl  search: actually, there are only few
 such users, because it takes considerable resources to manage operations
 on that scale. Scalability, manage-ability and ranking/spam prevention
 are the chief concerns here.
 
 2. Medium-scale vertical search: I suspect that many Nutch users fall
 into this category. Modularity, flexibility in implementing custom
 processing, ability to modify workflows and to use only some Nutch
 components seem to be chief concerns here. Scalability too, but only up
 to a volume of ~100-200 mln documents.
 
 3. Small- to medium-scale enterprise search: there's a sizeable number
 of Nutch users that fall into this category, for historical reasons.
 Link-based ranking and resource discovery are not that important here,
 but integration with Windows networking, Microsoft formats and databases
 , as well as realtime indexing and easy index maintenance are crucial.
 This class of users often has to heavily customize Nutch to get any
 sensible result. Also, this is where Solr really shines, so there is
 little benefit in using Nutch here. I predict that Nutch will have fewer
 and fewer users of this type.
 
 4. Single desktop to small intranet search: as above, but the accent is
 on the ease of use out of the box, and an often requested feature is a
 GUI frontend. Currently IMHO Nutch is too complex and requires too much
 command-line operation for casual users to make this use case attractive.
 
 What is the target audience that we as a community want to support? By
 this I mean not only the moral support, but also active participation in
 the development process. From the place where we are at the moment we
 could go in any of the above directions.
 
 Core competence
 ===
 This is a simple but important point. Currently we maintain several
 major subsystems in Nutch that are implemented by other projects, and
 often in a better way. Plugin framework (and dependency injection) and
 content parsing are two areas that we have to delegate to third-party
 libraries, such as Tika and OSGI or some other simple IOC container -
 probably there are other components that we don't have to do ourselves.
 Another thing that I'd love to delegate is the distributed search and
 index maintenance - either through Solr or Katta or something else.
 
 The question then is, what is the core competence of this project? I see
 the following major areas that are unique to Nutch:
 
 * crawling - this includes crawl scheduling (and re-crawl scheduling),
 discovery and classification of new resources, strategies for crawling
 specific sets of URLs (hosts and domains) under bandwidth and netiquette
 constraints, etc.
 
 * web graph analysis - this includes link-based ranking, mirror
 detection (and URL aliasing) but also link spam detection and a more
 complex control over the crawling frontier.
 
 Anything more? I'm not sure - perhaps I would add template detection and
 pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
 
 Nutch 1.0 already made some steps in this direction, with the new link
 analysis package and pluggable FetchSchedule and Signature. A lot
 remains to be done here, and we are still spending a lot of resources on
 dealing with issues outside this core competence.
 
 ---
 
 So, what do we need to do next?
 
 * we need to decide where we should commit our resources, as a community
 of users, contributors and committers, so that the project is most
 useful to our target audience. At this point there are few active
 committers, so I don't think we can cover more than 1 direction at a
 time ... ;)
 
 * we need to re-architect Nutch to focus on our core competence, and
 delegate what we can to other projects.
 
 

Re: The Future of Nutch

2009-04-02 Thread Thorsten Scherler
On Wed, 2009-04-01 at 07:42 -0700, Ken Krugler wrote:
...
 I would suggest looking at Katta (http://katta.sourceforge.net/). 
 It's one of several projects where the goal is to support very large 
 Lucene indexes via distributed shards. Solr has also added federated 
 search support.

Interesting. Thanks for the link Ken.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)






Re: The Future of Nutch

2009-04-02 Thread Doğacan Güney
On Wed, Apr 1, 2009 at 17:42, Ken Krugler kkrugler_li...@transpac.comwrote:

  On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote:

  hi dennis:

 ...
   I am confident that hadoop can process the large datas of the  www
 search

  engine! But lucene? I am afraid of the limited size of lucene's index
 per
  server is very little ,10G? or 30G? this is not enough for the www
 search

   engine! IMO, this is a bottleneck!

 I agree that the actual problem/solution of accessing lucene indexes is
 to keep them small. What does the possibility of having a clouded index
 serve if accessing it takes hours?

 For me here should lie one of nutch core competences: making search in
 BIG indexes fast (as fast as in SMALL indexes).


 I would suggest looking at Katta (http://katta.sourceforge.net/). It's one
 of several projects where the goal is to support very large Lucene indexes
 via distributed shards. Solr has also added federated search support.


I agree. I think the new index framework should be flexible enough that we
can support katta along
with solr. Actually, this is one of the things I want to do before the next
major release.



 -- Ken
 --
 Ken Krugler
 +1 530-210-6378




-- 
Doğacan Güney


Re: The Future of Nutch

2009-03-31 Thread Thorsten Scherler
On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote:
 hi dennis:
...
 I am confident that hadoop can process the large datas of the  www search
 engine! But lucene? I am afraid of the limited size of lucene's index per
 server is very little ,10G? or 30G? this is not enough for the www search
 engine! IMO, this is a bottleneck!

I agree that the actual problem/solution of accessing lucene indexes is
to keep them small. What does the possibility of having a clouded index
serve if accessing it takes hours? 

For me here should lie one of nutch core competences: making search in
BIG indexes fast (as fast as in SMALL indexes). 

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source consulting, training and solutions



Re: The Future of Nutch

2009-03-31 Thread Thorsten Scherler
On Fri, 2009-03-20 at 11:55 +0200, Doğacan Güney wrote:
 Hi,
 
 On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote:
...
  Since there are different purposes for different users, would it be good to
  consider moving Nutch to a top level apache project out from under the
  Lucene umbrella?  This would then allow the creation of nutch sub-projects,
  such as nutch-solr, nutch-hbase.  Thoughts?
 
  Many parts of Nutch have also been implemented in other projects.  For
  example, Tika for the parsers, Droids for the Crawler.  In begs the question
  what is Nutch's core features going forward.  When I think about search
  (again my perspective is large scale), I think crawling or acquisition of
  data, parsing, analysis, indexing, deployment, and searching.  I personally
  think that there is much room for improvement in crawling and especially
  analysis.  Nutch shouldn't just be about the shell but also the brains.
 
 
...
 So I think delegating nutch functionality to other projects
 (tika/droids/solr/etc)
 is a great idea (so nutch can focus on the brains as Dennis said), but
 I don't like the idea of separating nutch into pieces.

I hoped to meet some nutch people at apacheCon to talk about this mail. 

Droids is ATM incubating with 2 sponsor projects HC and lucene. With
nutch becoming TLP droids would be much more a nutch subproject then one
of the before mentioned. 

I see the essence of this thread and the current reality of moving
functionality away from nutch. Tika is the attempt to use the parser
functionality outside from nutch. Droids uses tika for parsing and even
so I would welcome that tika splits in different parser parts to reduce
dependencies in droids. 

Hearing that droids could become nutch standard fetcher/crawler is
really exciting. I invite everyone to join droids mailing list to make
this happening. Droids is similar to tika an attempt to use crawling
facility outside of nutch.

Nutch core competence is:
- indexing
- searching

where the focus is on:
- make it happen in the cloud
- on a BIG scale
- with millions of slaves

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source consulting, training and solutions



Re: The Future of Nutch

2009-03-27 Thread Bradford Stephens
Hey there,

Just chiming in that we use the complete Nutch + Hadoop + Lucene stack
-- we download pages, index them for keywords, and then do heavy
Semantic Parsing on it to produce BI data. We also use a lot of
plug-ins for parsing and ranking information.

What we don't use is the 'built-in GUI search' ability... but besides
that, the core of our business is evolving around Nutch :)

2009/3/20 Doğacan Güney doga...@gmail.com:
 Hi,

 On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote:
 With the release of Nutch 1.0 I think it is a good time to begin a
 discussion about the future of Nutch.  Here are some things to consider and
 would love to here everyones views on this

 Nutch's original intention was as a large-scale www search engine.  That is
 a very specific goal.  Only a few people and organizations actually use it
 on that level.  (I just happen to be one of them as most of my work focuses
 on large scale web search as opposed to vertical search). Many, perhaps
 most, people using Nutch these days are either using parts of Nutch, such as
 the crawler, or are targeting towards vertical or intranet type search
 engines.  This can be seen in how many people have already started using the
 Solr integration features.  So while Nutch was originally intended as a www
 search, IMO most people aren't using it for that purpose.

 Since there are different purposes for different users, would it be good to
 consider moving Nutch to a top level apache project out from under the
 Lucene umbrella?  This would then allow the creation of nutch sub-projects,
 such as nutch-solr, nutch-hbase.  Thoughts?

 Many parts of Nutch have also been implemented in other projects.  For
 example, Tika for the parsers, Droids for the Crawler.  In begs the question
 what is Nutch's core features going forward.  When I think about search
 (again my perspective is large scale), I think crawling or acquisition of
 data, parsing, analysis, indexing, deployment, and searching.  I personally
 think that there is much room for improvement in crawling and especially
 analysis.  Nutch shouldn't just be about the shell but also the brains.


 I think nutch-solr and nutch-hbase should be in one unified project :)

 I can understand the difficulty (for newcomers) if we start depending
 on too many external projects. It would certainly be confusing
 to have to start a solr server then hbase master/slaves just to be
 able to crawl one intranet website locally. On the other hand,
 if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings,
 I am worried we will have to create a waaay too generic interface
 to deal with them and not reap the advantages of using solr over
 lucene and hbase over hadoop. Also, more backends possibly
 mean more bugs and more integration problems.

 So I think delegating nutch functionality to other projects
 (tika/droids/solr/etc)
 is a great idea (so nutch can focus on the brains as Dennis said), but
 I don't like the idea of separating nutch into pieces.

 So I guess for a small vertical search engine, it may seem unnecessary
 to also deal with solr/etc, but as long as we have good documentation*,
 they are not that difficult to handle. And they don't have a large performance
 memory overhead.

 About vertical/large-scale search engine split: I guess a good example here
 is Dennis' FieldIndexer work. It is much more flexible for people who want
 to extend nutch's indexing architecture, but maybe overkill for people (and
 I am not convinced that it is) wanting to run vintage nutch on a small-scale.
 I, again, don't like splitting nutch into two(or three, four...) parts
 like this. But
 I think having different crawl paths for different users is much more 
 manageable
 than having different architectures. So we always use solr/hbase/etc. as our
 architecture. But you can run a one-job indexer if you want or run 
 FieldIndexer.
 You can use the on-the-fly scoring scheme or you use page rank/other complex
 offline scoring schemes.

 And one of the biggest things I see is many newcomers to nutch have a very
 hard time getting started.  Part of this is understanding mapreduce
 mentality, part is documentation, part is there is only so much time some of
 us have to answer questions so some questions go unanswered on the lists.
  How might this be improved going forward?


 Docs, docs, docs :D

 Any other thoughts also welcome.  Really I want to start a discussion about
 where everyone thinks we are with the state of Nutch and its future.


 Thanks for starting the discussion Dennis.

 Dennis



 * And we don't have good documentation right now (and I am much
 to blame for it:). I think this should be an explicit goal for us in the
 future. I am thinking something like no major features without documentation
 in the wiki.



 --
 Doğacan Güney



Re: The Future of Nutch

2009-03-20 Thread Mattmann, Chris A
Guys,

I thought I'd chime in here. I don't have a lot of time tonight (long day
out here in California), but perhaps I can add more thoughts tomorrow.

My +1 for moving Nutch into a TLP. With a 1.0 release, and several prior
releases (~10), I think that the discussion is reasonable. I also tend to
agree with Dennis's view regarding it being a positive thing to have a Nutch
PMC. The project has been around since 2005, and whether activity has slowed
recently of late, or not, there are still folks who are actively interested
in Nutch, and use it in operational form on the day-to-day, myself included
in that area.

That said, I would like to revisit some of the ideas about the Next
Generation Nutch discussion:

http://markmail.org/message/mcnbgg7uf54snf55#query:next%20generation%20nutch
%20mattmann+page:1+mid:ofk3ob3hv4djmrmn+state:results

And use this as a spring board for some of the things we should really think
about if we make Nutch a TLP. IMHO, these ideas really justify Nutch as a
TLP because we:

1. have a 1.0 release (and several official 0.x releases and patch 0.x.y
patch releases)
2. have the system in real-world operations
3. have a plan going forward for a next gen or 2.0 architecture

As for Nutch being an integration platform for existing Lucene components, I
think that Nutch should certainly make use of existing functionality where
it makes sense (Tika, Solr, etc.), but we really need to take a hard look at
insulating the core POJO model of Nutch (Brin and Page paper here folks, I'm
talking the Anatomy of a Large-Scale Hypertextual Web Search Engine) from
the underlying technology substrate. That would be my on my list of top
goals for Nutch as a TLP. In fact, even thinking about this, I think it
lends itself very nicely to a category of sub-projects (e.g., Nutch-Hadoop,
Nutch-JMS, etc.) to think about from a TLP perspective.

Anyways, just wanted to chime in. I'll add more tomorrow.

Thanks,
Chris




On 3/17/09 7:05 PM, Marc Boucher marc.bouc...@hyperix.com wrote:

 Dennis,
 
 That adds another dimension to the issue which I had not considered.
 One avenue as you suggest would be to add another committer to the
 Lucene PMC. If that does not work them maybe going the route of TLP is
 the best option.
 
 Marc
 
 
 Part of this is about releases.  Currently releases are voted on by Lucene
 PMC members and it takes 3 members to confirm a vote.  There are only 2
 Nutch committers on the Lucene PMC.  So for releases, not that we have had
 many recently, other Lucene PMC members who may not be actively associated
 with Nutch would need to vote to release.  If Nutch was a TLP there would be
 a Nutch PMC which would most likely include all current Nutch committers.
  The other may be to add another Nutch committer to the Lucene PMC.
 
 
 My thoughts. And hopefully in the near future my small team will be
 able to contribute to Nutch in a meaningful way.
 
 Any and every contribution is welcome.
 
 Dennis
 
 

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: The Future of Nutch

2009-03-20 Thread Doğacan Güney
Hi,

On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote:
 With the release of Nutch 1.0 I think it is a good time to begin a
 discussion about the future of Nutch.  Here are some things to consider and
 would love to here everyones views on this

 Nutch's original intention was as a large-scale www search engine.  That is
 a very specific goal.  Only a few people and organizations actually use it
 on that level.  (I just happen to be one of them as most of my work focuses
 on large scale web search as opposed to vertical search). Many, perhaps
 most, people using Nutch these days are either using parts of Nutch, such as
 the crawler, or are targeting towards vertical or intranet type search
 engines.  This can be seen in how many people have already started using the
 Solr integration features.  So while Nutch was originally intended as a www
 search, IMO most people aren't using it for that purpose.

 Since there are different purposes for different users, would it be good to
 consider moving Nutch to a top level apache project out from under the
 Lucene umbrella?  This would then allow the creation of nutch sub-projects,
 such as nutch-solr, nutch-hbase.  Thoughts?

 Many parts of Nutch have also been implemented in other projects.  For
 example, Tika for the parsers, Droids for the Crawler.  In begs the question
 what is Nutch's core features going forward.  When I think about search
 (again my perspective is large scale), I think crawling or acquisition of
 data, parsing, analysis, indexing, deployment, and searching.  I personally
 think that there is much room for improvement in crawling and especially
 analysis.  Nutch shouldn't just be about the shell but also the brains.


I think nutch-solr and nutch-hbase should be in one unified project :)

I can understand the difficulty (for newcomers) if we start depending
on too many external projects. It would certainly be confusing
to have to start a solr server then hbase master/slaves just to be
able to crawl one intranet website locally. On the other hand,
if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings,
I am worried we will have to create a waaay too generic interface
to deal with them and not reap the advantages of using solr over
lucene and hbase over hadoop. Also, more backends possibly
mean more bugs and more integration problems.

So I think delegating nutch functionality to other projects
(tika/droids/solr/etc)
is a great idea (so nutch can focus on the brains as Dennis said), but
I don't like the idea of separating nutch into pieces.

So I guess for a small vertical search engine, it may seem unnecessary
to also deal with solr/etc, but as long as we have good documentation*,
they are not that difficult to handle. And they don't have a large performance
memory overhead.

About vertical/large-scale search engine split: I guess a good example here
is Dennis' FieldIndexer work. It is much more flexible for people who want
to extend nutch's indexing architecture, but maybe overkill for people (and
I am not convinced that it is) wanting to run vintage nutch on a small-scale.
I, again, don't like splitting nutch into two(or three, four...) parts
like this. But
I think having different crawl paths for different users is much more manageable
than having different architectures. So we always use solr/hbase/etc. as our
architecture. But you can run a one-job indexer if you want or run FieldIndexer.
You can use the on-the-fly scoring scheme or you use page rank/other complex
offline scoring schemes.

 And one of the biggest things I see is many newcomers to nutch have a very
 hard time getting started.  Part of this is understanding mapreduce
 mentality, part is documentation, part is there is only so much time some of
 us have to answer questions so some questions go unanswered on the lists.
  How might this be improved going forward?


Docs, docs, docs :D

 Any other thoughts also welcome.  Really I want to start a discussion about
 where everyone thinks we are with the state of Nutch and its future.


Thanks for starting the discussion Dennis.

 Dennis



* And we don't have good documentation right now (and I am much
to blame for it:). I think this should be an explicit goal for us in the
future. I am thinking something like no major features without documentation
in the wiki.



-- 
Doğacan Güney


Re: The Future of Nutch

2009-03-18 Thread Alex Basa

I actually use Nutch as a large scale search engine on two products.  I think a 
few things that would be nice to have are built in options to produce an 
incremental index and maybe a quartz scheduler to automate it completely.

One thing that would be nice is when one of us figures something out like doing 
an incremental index, we would create a document and post it to the wiki.  
Documentation has been one of the big hurdles for me.

Thanks for all your hard work and I hope to contribute to the project soon.

Alex

--- On Fri, 3/13/09, Dennis Kubes ku...@apache.org wrote:

 From: Dennis Kubes ku...@apache.org
 Subject: The Future of Nutch
 To: nutch-user@lucene.apache.org
 Date: Friday, March 13, 2009, 7:19 PM
 With the release of Nutch 1.0 I think it is a good time to
 begin a discussion about the future of Nutch.  Here are some
 things to consider and would love to here everyones views on
 this
 
 Nutch's original intention was as a large-scale www
 search engine.  That is a very specific goal.  Only a few
 people and organizations actually use it on that level.  (I
 just happen to be one of them as most of my work focuses on
 large scale web search as opposed to vertical search). Many,
 perhaps most, people using Nutch these days are either using
 parts of Nutch, such as the crawler, or are targeting
 towards vertical or intranet type search engines.  This can
 be seen in how many people have already started using the
 Solr integration features.  So while Nutch was originally
 intended as a www search, IMO most people aren't using
 it for that purpose.
 
 Since there are different purposes for different users,
 would it be good to consider moving Nutch to a top level
 apache project out from under the Lucene umbrella?  This
 would then allow the creation of nutch sub-projects, such as
 nutch-solr, nutch-hbase.  Thoughts?
 
 Many parts of Nutch have also been implemented in other
 projects.  For example, Tika for the parsers, Droids for the
 Crawler.  In begs the question what is Nutch's core
 features going forward.  When I think about search (again my
 perspective is large scale), I think crawling or acquisition
 of data, parsing, analysis, indexing, deployment, and
 searching.  I personally think that there is much room for
 improvement in crawling and especially analysis.  Nutch
 shouldn't just be about the shell but also the brains.
 
 And one of the biggest things I see is many newcomers to
 nutch have a very hard time getting started.  Part of this
 is understanding mapreduce mentality, part is documentation,
 part is there is only so much time some of us have to answer
 questions so some questions go unanswered on the lists.  How
 might this be improved going forward?
 
 Any other thoughts also welcome.  Really I want to start a
 discussion about where everyone thinks we are with the state
 of Nutch and its future.
 
 Dennis


  


Re: The Future of Nutch

2009-03-17 Thread Marc Boucher
Dennis, Otis et al,

My very small team has kept silent for a long time. We've been playing
with Nutch, Hadoop and to a lesser extent Solr for about 2 years now.
Before I get into my thoughts on what direction things should take I
would like to offer a thought on why Nutch is not as active as other
groups.

I think in part it's because what Nutch represents and that's the
ability of creating a large scale search. Some developers would rather
use Nutch and associated tools and keep quiet about it because of
their goals, which in some case might mean competing against the likes
of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to
compete with those companies on large scale search but I can see
competition in the vertical markets. And while Solr is hot these days
it's intended primarily for the enterprise market which is very
different than the large scale and vertical markets.

Now on to the future. I agree with many of the thoughts Otis put forward.

While Nutch has it's problems other than Heritrix there is no other
open source system available and Nutch's ability to perform web-wide
crawls must be preserved. However I'm thinking we should have modular
approach to Nutch. For instance, why just one fetcher? Why not keep
the current one but also allow for the possibility of using Droids?
Parsing can and should include Tika. I'm not sure about outsourcing
indexing and searching to Solr but that could be a modular option as
well.

I'm not sure if Nutch should become a top level project and move out
from under Lucene. Lucene has great visibility and for many reasons.
If Nutch was moved, would it still attract enough attention? It's been
noted that developer interest in Nutch is different that Lucene, Solr
etc. On the other hand it might do Nutch good to go TLP as maybe then
it would attract more developers especially if it was packaged
differently.

My thoughts. And hopefully in the near future my small team will be
able to contribute to Nutch in a meaningful way.

Marc Boucher
http://hyperix.com

On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:

 Hello,


 Comments inlined.

 - Original Message 
 From: Dennis Kubes ku...@apache.org
 To: nutch-user@lucene.apache.org
 Sent: Friday, March 13, 2009 8:19:37 PM

 With the release of Nutch 1.0 I think it is a good time to begin a discussion
 about the future of Nutch.  Here are some things to consider and would love 
 to
 here everyones views on this

 Nutch's original intention was as a large-scale www search engine.  That is a
 very specific goal.  Only a few people and organizations actually use it on 
 that
 level.  (I just happen to be one of them as most of my work focuses on large
 scale web search as opposed to vertical search).

 Yes, there are fewer parties doing large scale web crawling.  Still, as there 
 is no alternative fetcher+parser+indexer+searcher capable of handling large 
 scale deployments like Nutch (or maybe Heritrix has the same scaling 
 capabilities?), I think Nutch's ability to perform web-wide crawls, etc. 
 should be preserved.

 Many, perhaps most, people
 using Nutch these days are either using parts of Nutch, such as the crawler, 
 or
 are targeting towards vertical or intranet type search engines.  This can be
 seen in how many people have already started using the Solr integration
 features.  So while Nutch was originally intended as a www search, IMO most
 people aren't using it for that purpose.


 That's my experience, too.  I think we can have both under the same Nutch 
 roof.

 Since there are different purposes for different users, would it be good to
 consider moving Nutch to a top level apache project out from under the Lucene
 umbrella?  This would then allow the creation of nutch sub-projects, such as
 nutch-solr, nutch-hbase.  Thoughts?


 I disagree, at least in the near term.  There is nothing preventing those 
 sub-projects existing under Nutch today.  Both Solr and Lucene have the 
 contrib area where similar sub-projects live.  I think it's not a matter of 
 being a TLP, but rather attracting enough developer interest, then user 
 interest, and then contributor interest, so that these sub-projects can be 
 created, maintained, advanced.  Right now, Solr gets a TON of attention, as 
 does Lucene.  Nutch gets the least developer attention, and for some reason 
 the nutch-user subscribers feel a bit different from solr-user or java-user 
 subscribers.

 Many parts of Nutch have also been implemented in other projects.  For 
 example,
 Tika for the parsers, Droids for the Crawler.  In begs the question what is
 Nutch's core features going forward.  When I think about search (again my
 perspective is large scale), I think crawling or acquisition of data, 
 parsing,
 analysis, indexing, deployment, and searching.  I personally think that 
 there is
 much room for improvement in crawling and especially analysis.  Nutch 
 shouldn't
 just be about the shell but also the 

Re: The Future of Nutch

2009-03-17 Thread Dennis Kubes

Marc,

Glad you responded.  Always good to hear peoples thoughts.

Marc Boucher wrote:

Dennis, Otis et al,

My very small team has kept silent for a long time. We've been playing
with Nutch, Hadoop and to a lesser extent Solr for about 2 years now.
Before I get into my thoughts on what direction things should take I
would like to offer a thought on why Nutch is not as active as other
groups.

I think in part it's because what Nutch represents and that's the
ability of creating a large scale search. Some developers would rather
use Nutch and associated tools and keep quiet about it because of
their goals, which in some case might mean competing against the likes
of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to
compete with those companies on large scale search but I can see
competition in the vertical markets. And while Solr is hot these days
it's intended primarily for the enterprise market which is very
different than the large scale and vertical markets.


I completely agree.  The group of people/companies that are creating 
large scale search solutions whether whole web or vertical is much 
smaller than say enterprise search or even the potential uses for Hadoop.




Now on to the future. I agree with many of the thoughts Otis put forward.

While Nutch has it's problems other than Heritrix there is no other
open source system available and Nutch's ability to perform web-wide
crawls must be preserved. However I'm thinking we should have modular
approach to Nutch. For instance, why just one fetcher? Why not keep
the current one but also allow for the possibility of using Droids?
Parsing can and should include Tika. I'm not sure about outsourcing
indexing and searching to Solr but that could be a modular option as
well.


Yup.  It should IMO also be easy to install and configure.  I was having 
a discussion today where the main topic was, could we make Nutch have a 
nice graphical web interface for configuration, when you could drop it 
in, change some options, and create a customized vertical search over x 
domains?




I'm not sure if Nutch should become a top level project and move out
from under Lucene. Lucene has great visibility and for many reasons.
If Nutch was moved, would it still attract enough attention? It's been
noted that developer interest in Nutch is different that Lucene, Solr
etc. On the other hand it might do Nutch good to go TLP as maybe then
it would attract more developers especially if it was packaged
differently.


Part of this is about releases.  Currently releases are voted on by 
Lucene PMC members and it takes 3 members to confirm a vote.  There are 
only 2 Nutch committers on the Lucene PMC.  So for releases, not that we 
have had many recently, other Lucene PMC members who may not be actively 
associated with Nutch would need to vote to release.  If Nutch was a TLP 
there would be a Nutch PMC which would most likely include all current 
Nutch committers.  The other may be to add another Nutch committer to 
the Lucene PMC.




My thoughts. And hopefully in the near future my small team will be
able to contribute to Nutch in a meaningful way.


Any and every contribution is welcome.

Dennis



Marc Boucher
http://hyperix.com

On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:

Hello,


Comments inlined.

- Original Message 

From: Dennis Kubes ku...@apache.org
To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 8:19:37 PM

With the release of Nutch 1.0 I think it is a good time to begin a discussion
about the future of Nutch.  Here are some things to consider and would love to
here everyones views on this

Nutch's original intention was as a large-scale www search engine.  That is a
very specific goal.  Only a few people and organizations actually use it on that
level.  (I just happen to be one of them as most of my work focuses on large
scale web search as opposed to vertical search).

Yes, there are fewer parties doing large scale web crawling.  Still, as there 
is no alternative fetcher+parser+indexer+searcher capable of handling large 
scale deployments like Nutch (or maybe Heritrix has the same scaling 
capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should 
be preserved.


Many, perhaps most, people
using Nutch these days are either using parts of Nutch, such as the crawler, or
are targeting towards vertical or intranet type search engines.  This can be
seen in how many people have already started using the Solr integration
features.  So while Nutch was originally intended as a www search, IMO most
people aren't using it for that purpose.


That's my experience, too.  I think we can have both under the same Nutch roof.


Since there are different purposes for different users, would it be good to
consider moving Nutch to a top level apache project out from under the Lucene
umbrella?  This would then allow the creation of nutch sub-projects, such as
nutch-solr, nutch-hbase.  

Re: The Future of Nutch

2009-03-17 Thread Marc Boucher
Dennis,

That adds another dimension to the issue which I had not considered.
One avenue as you suggest would be to add another committer to the
Lucene PMC. If that does not work them maybe going the route of TLP is
the best option.

Marc


 Part of this is about releases.  Currently releases are voted on by Lucene
 PMC members and it takes 3 members to confirm a vote.  There are only 2
 Nutch committers on the Lucene PMC.  So for releases, not that we have had
 many recently, other Lucene PMC members who may not be actively associated
 with Nutch would need to vote to release.  If Nutch was a TLP there would be
 a Nutch PMC which would most likely include all current Nutch committers.
  The other may be to add another Nutch committer to the Lucene PMC.


 My thoughts. And hopefully in the near future my small team will be
 able to contribute to Nutch in a meaningful way.

 Any and every contribution is welcome.

 Dennis



Re: The Future of Nutch

2009-03-16 Thread Otis Gospodnetic

Hello,

 
Comments inlined.

- Original Message 
 From: Dennis Kubes ku...@apache.org
 To: nutch-user@lucene.apache.org
 Sent: Friday, March 13, 2009 8:19:37 PM
 
 With the release of Nutch 1.0 I think it is a good time to begin a discussion 
 about the future of Nutch.  Here are some things to consider and would love 
 to 
 here everyones views on this
 
 Nutch's original intention was as a large-scale www search engine.  That is a 
 very specific goal.  Only a few people and organizations actually use it on 
 that 
 level.  (I just happen to be one of them as most of my work focuses on large 
 scale web search as opposed to vertical search). 

Yes, there are fewer parties doing large scale web crawling.  Still, as there 
is no alternative fetcher+parser+indexer+searcher capable of handling large 
scale deployments like Nutch (or maybe Heritrix has the same scaling 
capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should 
be preserved.

 Many, perhaps most, people 
 using Nutch these days are either using parts of Nutch, such as the crawler, 
 or 
 are targeting towards vertical or intranet type search engines.  This can be 
 seen in how many people have already started using the Solr integration 
 features.  So while Nutch was originally intended as a www search, IMO most 
 people aren't using it for that purpose.


That's my experience, too.  I think we can have both under the same Nutch roof.

 Since there are different purposes for different users, would it be good to 
 consider moving Nutch to a top level apache project out from under the Lucene 
 umbrella?  This would then allow the creation of nutch sub-projects, such as 
 nutch-solr, nutch-hbase.  Thoughts?


I disagree, at least in the near term.  There is nothing preventing those 
sub-projects existing under Nutch today.  Both Solr and Lucene have the contrib 
area where similar sub-projects live.  I think it's not a matter of being a 
TLP, but rather attracting enough developer interest, then user interest, and 
then contributor interest, so that these sub-projects can be created, 
maintained, advanced.  Right now, Solr gets a TON of attention, as does Lucene. 
 Nutch gets the least developer attention, and for some reason the nutch-user 
subscribers feel a bit different from solr-user or java-user subscribers.

 Many parts of Nutch have also been implemented in other projects.  For 
 example, 
 Tika for the parsers, Droids for the Crawler.  In begs the question what is 
 Nutch's core features going forward.  When I think about search (again my 
 perspective is large scale), I think crawling or acquisition of data, 
 parsing, 
 analysis, indexing, deployment, and searching.  I personally think that there 
 is 
 much room for improvement in crawling and especially analysis.  Nutch 
 shouldn't 
 just be about the shell but also the brains.


My feeling has long been that indexing and searching should be outsourced to 
Solr, parsing to Tika, and that the fetcher should probably be replaced with 
Droids.  I say probably because I'm not very familiar with Droids just yet.  
Nutch should, I think, then be an application built with all those components 
combined (is that what you mean by the shell?), and then apply its knowledge of 
either web-wide scale trickery, or vertical SE trickery, or ...  I think that's 
where the brains are needed, to tie it all together, while still making certain 
pieces swappable and more easily digestible by potential new contributors and 
developers, as well as users.  I know plugins do some of that already, but it 
seems like there might still be more in the fore than there should/could be...

 And one of the biggest things I see is many newcomers to nutch have a very 
 hard 
 time getting started.  Part of this is understanding mapreduce mentality, 
 part 
 is documentation, part is there is only so much time some of us have to 
 answer 
 questions so some questions go unanswered on the lists.  How might this be 
 improved going forward?

I am not 100% sure, but I think it's a bit of all of the above.  Lucene has 
been around for 10 years and from day one had people answer questions from the 
most basic ones to the trickiest ones.  It's the same with Solr today.  Nutch 
has the least active and the smallest developer base, so questions don't get 
answered.  Again, people on this list also tend to have a different style of 
asking questions - no hellos, no thank yous, and so on, which doesn't help.


I think the existence of a book on Lucene helped Lucene, but Solr doesn't yet 
have a book, yet it still has a healthy developer and user community.  I think 
that's because Solr is simply more needed by more people than Nutch is.

 Any other thoughts also welcome.  Really I want to start a discussion about 
 where everyone thinks we are with the state of Nutch and its future.


I think it's good you started this discussion.  My opinion about what needs to 
be done with Nutch is above.  I 

Re: The Future of Nutch

2009-03-16 Thread Tony Wang
I just wish there could be some clear documentation for Nutch/Solr
integration publicly available. Or some developers are already working on
this?
- Tony

On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic ogjunk-nu...@yahoo.comwrote:


 Hello,


 Comments inlined.

 - Original Message 
  From: Dennis Kubes ku...@apache.org
  To: nutch-user@lucene.apache.org
  Sent: Friday, March 13, 2009 8:19:37 PM
 
  With the release of Nutch 1.0 I think it is a good time to begin a
 discussion
  about the future of Nutch.  Here are some things to consider and would
 love to
  here everyones views on this
 
  Nutch's original intention was as a large-scale www search engine.  That
 is a
  very specific goal.  Only a few people and organizations actually use it
 on that
  level.  (I just happen to be one of them as most of my work focuses on
 large
  scale web search as opposed to vertical search).

 Yes, there are fewer parties doing large scale web crawling.  Still, as
 there is no alternative fetcher+parser+indexer+searcher capable of handling
 large scale deployments like Nutch (or maybe Heritrix has the same scaling
 capabilities?), I think Nutch's ability to perform web-wide crawls, etc.
 should be preserved.

  Many, perhaps most, people
  using Nutch these days are either using parts of Nutch, such as the
 crawler, or
  are targeting towards vertical or intranet type search engines.  This can
 be
  seen in how many people have already started using the Solr integration
  features.  So while Nutch was originally intended as a www search, IMO
 most
  people aren't using it for that purpose.


 That's my experience, too.  I think we can have both under the same Nutch
 roof.

  Since there are different purposes for different users, would it be good
 to
  consider moving Nutch to a top level apache project out from under the
 Lucene
  umbrella?  This would then allow the creation of nutch sub-projects, such
 as
  nutch-solr, nutch-hbase.  Thoughts?


 I disagree, at least in the near term.  There is nothing preventing those
 sub-projects existing under Nutch today.  Both Solr and Lucene have the
 contrib area where similar sub-projects live.  I think it's not a matter of
 being a TLP, but rather attracting enough developer interest, then user
 interest, and then contributor interest, so that these sub-projects can be
 created, maintained, advanced.  Right now, Solr gets a TON of attention, as
 does Lucene.  Nutch gets the least developer attention, and for some reason
 the nutch-user subscribers feel a bit different from solr-user or
 java-user subscribers.

  Many parts of Nutch have also been implemented in other projects.  For
 example,
  Tika for the parsers, Droids for the Crawler.  In begs the question what
 is
  Nutch's core features going forward.  When I think about search (again my
  perspective is large scale), I think crawling or acquisition of data,
 parsing,
  analysis, indexing, deployment, and searching.  I personally think that
 there is
  much room for improvement in crawling and especially analysis.  Nutch
 shouldn't
  just be about the shell but also the brains.


 My feeling has long been that indexing and searching should be outsourced
 to Solr, parsing to Tika, and that the fetcher should probably be replaced
 with Droids.  I say probably because I'm not very familiar with Droids just
 yet.  Nutch should, I think, then be an application built with all those
 components combined (is that what you mean by the shell?), and then apply
 its knowledge of either web-wide scale trickery, or vertical SE trickery, or
 ...  I think that's where the brains are needed, to tie it all together,
 while still making certain pieces swappable and more easily digestible by
 potential new contributors and developers, as well as users.  I know plugins
 do some of that already, but it seems like there might still be more in the
 fore than there should/could be...

  And one of the biggest things I see is many newcomers to nutch have a
 very hard
  time getting started.  Part of this is understanding mapreduce mentality,
 part
  is documentation, part is there is only so much time some of us have to
 answer
  questions so some questions go unanswered on the lists.  How might this
 be
  improved going forward?

 I am not 100% sure, but I think it's a bit of all of the above.  Lucene has
 been around for 10 years and from day one had people answer questions from
 the most basic ones to the trickiest ones.  It's the same with Solr today.
  Nutch has the least active and the smallest developer base, so questions
 don't get answered.  Again, people on this list also tend to have a
 different style of asking questions - no hellos, no thank yous, and so on,
 which doesn't help.


 I think the existence of a book on Lucene helped Lucene, but Solr doesn't
 yet have a book, yet it still has a healthy developer and user community.  I
 think that's because Solr is simply more needed by more people than Nutch
 is.

  Any 

Re: The Future of Nutch

2009-03-14 Thread yanky young
Hi:

I also agree that the most usage scenarios of nutch are in vertical search
area. and in some unusual case users may don't even use nutch indexing at
all. they just crawl some pages as mirror purpose. and in some cases of
vertical search, user only need a fraction of pages, e.g. house rent info,
restraunt info. so how about distribute nutch as components so that crawler
can be indepently used without indexing. that's actually what droid is
trying to address. but nutch can also do it in a more scalable way.

another point is that, if nutch need to be more easily customized for
special cases such as vertical search, new ranking machenism must be
introduced. tf/idf just can not work. maybe machine learning scheme such as
text classifier can be employed.

it is great for nutch to be a top apache project, because subprojects for
special case can be created for easier customization.

i have also seen posts about using spring as nutch components assemely
framework. maybe it can be created as subproject for spring users.

just my 2 cents

good luck

yanky

2009/3/14 buddha1021 buddha1...@yahoo.cn


 hi dennis:

 Nutch's original intention was as a large-scale www search engine. 
 I am very agreeing with you! Dennis! nutch's goal is specificly that
 achives
 the goal like google to process the large-scale datas! There is no doubt
 that nutch will be a www search engine absolutely,but absolutely not a
 vertical search !

 I am confident that hadoop can process the large datas of the  www search
 engine! But lucene? I am afraid of the limited size of lucene's index per
 server is very little ,10G? or 30G? this is not enough for the www search
 engine! IMO, this is a bottleneck!

 how many pages do visvo search currently? 100 millions? or 1000 millions?

 IMO ,it will be very good that moving Nutch to a top level apache project
 out from under
 the Lucene umbrella !

 but all the sub-projects of nutch should be active enough, if not, nutch's
 develop will be slow and it is no good for nutch's unity.

 So the number of the sub-projects should be less !
  and  the sub-projects should be active ,efficient and also strong enough !

 Good luck !



 Dennis Kubes-2 wrote:
 
  With the release of Nutch 1.0 I think it is a good time to begin a
  discussion about the future of Nutch.  Here are some things to consider
  and would love to here everyones views on this
 
  Nutch's original intention was as a large-scale www search engine.  That
  is a very specific goal.  Only a few people and organizations actually
  use it on that level.  (I just happen to be one of them as most of my
  work focuses on large scale web search as opposed to vertical search).
  Many, perhaps most, people using Nutch these days are either using parts
  of Nutch, such as the crawler, or are targeting towards vertical or
  intranet type search engines.  This can be seen in how many people have
  already started using the Solr integration features.  So while Nutch was
  originally intended as a www search, IMO most people aren't using it for
  that purpose.
 
  Since there are different purposes for different users, would it be good
  to consider moving Nutch to a top level apache project out from under
  the Lucene umbrella?  This would then allow the creation of nutch
  sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?
 
  Many parts of Nutch have also been implemented in other projects.  For
  example, Tika for the parsers, Droids for the Crawler.  In begs the
  question what is Nutch's core features going forward.  When I think
  about search (again my perspective is large scale), I think crawling or
  acquisition of data, parsing, analysis, indexing, deployment, and
  searching.  I personally think that there is much room for improvement
  in crawling and especially analysis.  Nutch shouldn't just be about the
  shell but also the brains.
 
  And one of the biggest things I see is many newcomers to nutch have a
  very hard time getting started.  Part of this is understanding mapreduce
  mentality, part is documentation, part is there is only so much time
  some of us have to answer questions so some questions go unanswered on
  the lists.  How might this be improved going forward?
 
  Any other thoughts also welcome.  Really I want to start a discussion
  about where everyone thinks we are with the state of Nutch and its
 future.
 
  Dennis
 
 
 

 --
 View this message in context:
 http://www.nabble.com/The-Future-of-Nutch-tp22507507p22508747.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: The Future of Nutch

2009-03-14 Thread consultas
I am using Nutch for more than four years now, as a vertical search engine, 
having indexed, some times, over one million pages.  On the other hand, I 
dont know nothing about programming and some specialized aplications.  Words 
like solr and others are like aliens for me.  I am just interested in a 
search engine that someone can, really, use and not an application that 
serve as a base for developping sophisticated models.
So, what I, personally want for the future of Nutch is that it does not turn 
in such a complicated aplication that just some very skilled people can use.
So I hope that Nutch keeps, allways, an eye on the real users, that want it 
for plain searching.

Thanks



- Original Message - 
From: Dennis Kubes ku...@apache.org

To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 9:19 PM
Subject: The Future of Nutch



With the release of Nutch 1.0 I think it is a good time to begin a
discussion about the future of Nutch.  Here are some things to consider
and would love to here everyones views on this

Nutch's original intention was as a large-scale www search engine.  That
is a very specific goal.  Only a few people and organizations actually
use it on that level.  (I just happen to be one of them as most of my
work focuses on large scale web search as opposed to vertical search).
Many, perhaps most, people using Nutch these days are either using parts
of Nutch, such as the crawler, or are targeting towards vertical or
intranet type search engines.  This can be seen in how many people have
already started using the Solr integration features.  So while Nutch was
originally intended as a www search, IMO most people aren't using it for
that purpose.

Since there are different purposes for different users, would it be good
to consider moving Nutch to a top level apache project out from under
the Lucene umbrella?  This would then allow the creation of nutch
sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?

Many parts of Nutch have also been implemented in other projects.  For
example, Tika for the parsers, Droids for the Crawler.  In begs the
question what is Nutch's core features going forward.  When I think
about search (again my perspective is large scale), I think crawling or
acquisition of data, parsing, analysis, indexing, deployment, and
searching.  I personally think that there is much room for improvement
in crawling and especially analysis.  Nutch shouldn't just be about the
shell but also the brains.

And one of the biggest things I see is many newcomers to nutch have a
very hard time getting started.  Part of this is understanding mapreduce
mentality, part is documentation, part is there is only so much time
some of us have to answer questions so some questions go unanswered on
the lists.  How might this be improved going forward?

Any other thoughts also welcome.  Really I want to start a discussion
about where everyone thinks we are with the state of Nutch and its future.

Dennis








No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.11.13/2001 - Release Date: 03/14/09 
06:54:00




Re: The Future of Nutch

2009-03-14 Thread John Martyniak
I think that this would be the case for making Nutch a top level  
Apache Project.  So that you can publish the framework and a complete  
app but still tie it all together.  Because personally I think that is  
the strength of Nutch, that you can use it right out of the box,  
without programming.  But all of extensibility (customization) is  
there so that you can extend it if you so desire.


-John

On Mar 14, 2009, at 9:44 AM, consultas wrote:

I am using Nutch for more than four years now, as a vertical search  
engine, having indexed, some times, over one million pages.  On the  
other hand, I dont know nothing about programming and some  
specialized aplications.  Words like solr and others are like aliens  
for me.  I am just interested in a search engine that someone can,  
really, use and not an application that serve as a base for  
developping sophisticated models.
So, what I, personally want for the future of Nutch is that it does  
not turn in such a complicated aplication that just some very  
skilled people can use.
So I hope that Nutch keeps, allways, an eye on the real users, that  
want it for plain searching.

Thanks



- Original Message - From: Dennis Kubes ku...@apache.org
To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 9:19 PM
Subject: The Future of Nutch



With the release of Nutch 1.0 I think it is a good time to begin a
discussion about the future of Nutch.  Here are some things to  
consider

and would love to here everyones views on this

Nutch's original intention was as a large-scale www search engine.   
That
is a very specific goal.  Only a few people and organizations  
actually

use it on that level.  (I just happen to be one of them as most of my
work focuses on large scale web search as opposed to vertical  
search).
Many, perhaps most, people using Nutch these days are either using  
parts

of Nutch, such as the crawler, or are targeting towards vertical or
intranet type search engines.  This can be seen in how many people  
have
already started using the Solr integration features.  So while  
Nutch was
originally intended as a www search, IMO most people aren't using  
it for

that purpose.

Since there are different purposes for different users, would it be  
good

to consider moving Nutch to a top level apache project out from under
the Lucene umbrella?  This would then allow the creation of nutch
sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?

Many parts of Nutch have also been implemented in other projects.   
For

example, Tika for the parsers, Droids for the Crawler.  In begs the
question what is Nutch's core features going forward.  When I think
about search (again my perspective is large scale), I think  
crawling or

acquisition of data, parsing, analysis, indexing, deployment, and
searching.  I personally think that there is much room for  
improvement
in crawling and especially analysis.  Nutch shouldn't just be about  
the

shell but also the brains.

And one of the biggest things I see is many newcomers to nutch have a
very hard time getting started.  Part of this is understanding  
mapreduce

mentality, part is documentation, part is there is only so much time
some of us have to answer questions so some questions go unanswered  
on

the lists.  How might this be improved going forward?

Any other thoughts also welcome.  Really I want to start a discussion
about where everyone thinks we are with the state of Nutch and its  
future.


Dennis








No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.11.13/2001 - Release Date:  
03/14/09 06:54:00






Re: The Future of Nutch

2009-03-13 Thread John Martyniak

Dennis,

I am with you, I am building a large scale www search engine.  But  
might also build a vertical search as well.  Aren't the requirements  
the same for building a large scale www search, against building a  
vertical www search, the only thing that seems to change is the scope.


I like the idea of making nutch work with multiple types of crawlers  
(maybe a crawler pluginkind of thing).  I have looked at Droids and it  
seems interesting.


Regarding the SOLR integration I am not sure that I agree with on that  
point.  As I have considered using the SOLR integration for my WWW  
index.  And the main reasons are that SOLR seems to have stronger  
search engine features at this point, like faceting, collapsing,  
synonyms, spelling, etc. but Nutch clearly has crawling and processing  
large amounts of data into a index down pat.


Regarding the MapReduce, if it is good enough for Google, then it is  
good enough for Nutch.


I think that if you segment Nutch into too many sub projects you lose  
the flexibility or ability to have a good single solid, scaleable  
search engine.


Just my .02 cents.

-John


On Mar 13, 2009, at 6:19 PM, Dennis Kubes wrote:

With the release of Nutch 1.0 I think it is a good time to begin a  
discussion about the future of Nutch.  Here are some things to  
consider and would love to here everyones views on this


Nutch's original intention was as a large-scale www search engine.   
That is a very specific goal.  Only a few people and organizations  
actually use it on that level.  (I just happen to be one of them as  
most of my work focuses on large scale web search as opposed to  
vertical search). Many, perhaps most, people using Nutch these days  
are either using parts of Nutch, such as the crawler, or are  
targeting towards vertical or intranet type search engines.  This  
can be seen in how many people have already started using the Solr  
integration features.  So while Nutch was originally intended as a  
www search, IMO most people aren't using it for that purpose.


Since there are different purposes for different users, would it be  
good to consider moving Nutch to a top level apache project out from  
under the Lucene umbrella?  This would then allow the creation of  
nutch sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?


Many parts of Nutch have also been implemented in other projects.   
For example, Tika for the parsers, Droids for the Crawler.  In begs  
the question what is Nutch's core features going forward.  When I  
think about search (again my perspective is large scale), I think  
crawling or acquisition of data, parsing, analysis, indexing,  
deployment, and searching.  I personally think that there is much  
room for improvement in crawling and especially analysis.  Nutch  
shouldn't just be about the shell but also the brains.


And one of the biggest things I see is many newcomers to nutch have  
a very hard time getting started.  Part of this is understanding  
mapreduce mentality, part is documentation, part is there is only so  
much time some of us have to answer questions so some questions go  
unanswered on the lists.  How might this be improved going forward?


Any other thoughts also welcome.  Really I want to start a  
discussion about where everyone thinks we are with the state of  
Nutch and its future.


Dennis