Re: The Future of Nutch, reactivated
Hi, Am joining the conversation a bit late but nevermind... In my views the main targets should be (2). As you pointed out, SOLR covers (3) and (4) quite well (or will progressively do so). As for (1), there is definitely an audience even if it is small but would certainly benefit from the work done towards (2). As you said, operating on a large scale (i.e using more than 100 slaves) requires a lot of resources and a dedicated team and I expect that the people interested in large scale would have their own views on scoring and spam prevention anyway :-) I completely agree that there should be as much delegation of functionalities to third-parties as possible (e.g. parsing with Tika) in order to focus on the core competences. I really like your idea of doing template detection for instance. Another thing I found promising is the HBase integration (NUTCH-650), which would also allow more interoperability with other tools such as Heritrix and make the data structure a bit more open. Talking about future functionalities, we do quite a lot of text analysis with tools like Gate or UIMA and have been working on things such as detection of adult content and automatic text classification with Nutch. There are plenty of interesting things that can be done for vertical search systems, such as Named Entity Extraction etc... Since NLP applications can be quite greedy, leveraging Hadoop is definitely an advantage. I'd love to see in the future versions of Nutch a separation between Format Parsing (i.e Tika) and content analysis, where implementations would get a semi-structured representation of the documents a bit like what extensions of HTML parsers are getting currently, but regardless of the original format. Have a good week end Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/5/14 Andrzej Bialecki a...@getopt.org Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under
Re: The Future of Nutch, reactivated
I 'm still a new user so although I found it rather easy to get going and build my own plugin's I have some suggestions. Yes one thing that I'd like to see is a kind of way to estimate how long will a certain step (fetch, ...) will take... something like a progress bar. Because you launch a step and it can go on for days without knowing it and perfectly working but still you have no idea when it might eventually end. I find the WEB front end rather difficult to change and I lost a lot of time with the NucthBean for understanding how it works. Coming from Lucene it took me a while to find out all the limitations it has. So I haven't played much with NutchSolr integration but from the sound of it looks more powerfull, simpler that is my concern. -Raymond- 2009/5/14 Mattmann, Chris A chris.a.mattm...@jpl.nasa.gov Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2: * refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where crawling/indexing/parsing/scoring/etc. are insulated from the underlying messaging substrate (e.g., crawl over JMS, EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other framework, etc.) * simpler Nutch deployment mechanisms (separate Nutch deployment package from source code package), think about using Maven2 +1 to all of those and other ideas for how to improve the project's focus. Cheers, Chris On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote: Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control
Re: The Future of Nutch, reactivated
Keep it simple. Many people, it seems to me, use nutch to exercise, in some way their programming expertise and talents. I am just a user, and I think that users just want something thant can index the web and find results, when they search. I don't want to deal with complicated application names, I just want to crawl and search. And, it should be noted that for most of the users, myself included, it is not a trivial job to get Nutch working, in Linux or Windows. Anyway, I think that the bigest use for Nutch will be for vertical or regional search purposes. By the way, from this point of view i really didn't like the experience with the original release of version of 1.0. Too slow the crawling phase. - Original Message - From: Andrzej Bialecki a...@getopt.org To: nutch-user@lucene.apache.org Sent: Thursday, May 14, 2009 10:45 AM Subject: The Future of Nutch, reactivated Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence. --- So, what do we need to do next? * we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;) * we need to re-architect Nutch to focus on
Re: The Future of Nutch, reactivated
Andrzej, great summary. I played with nutch before for web search engine, but has not used it for a while because it has become too complicated. based on my experience in building semantic search engine for healthcare vertical, it think it would be benefitial to separate crawling from search architecturaly and focus on just crawling for nutch. My sense is that, if nutch can make crawling simple and deliver high-quality crawled contents along with important metadata like link structure, it will have much better chance to become an indispensable part of search engine. Of course, it's important to include an implementation for search as well so that nutch can provide end-to-end (i.e. crawl and search) results for evaluation. but, don't get stuck in search because there are a variety of different search needs, such as static search, dynamic search, real time search, semantic search, etc. it's not easy to make nutch to meet all of these real-world needs. rather, nutch should provide the crawled contents in a way that people can easily apply different search tools or search technology. As for the audience, it makes sense to focus on the middle of the usage spectrum, ie. vertical search or focusd search in mid-range scale. but, I won't ignore the small projects or developer projects because this is often the start point for new project evaluation. -aj -- AJ Chen, PhD Co-Chair, Semantic Web SIG, sdforum.org Technical Architect, healthline.com http://web2express.org Palo Alto, CA On Thu, May 14, 2009 at 6:45 AM, Andrzej Bialecki a...@getopt.org wrote: Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template
Re: The Future of Nutch, reactivated
Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2: * refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where crawling/indexing/parsing/scoring/etc. are insulated from the underlying messaging substrate (e.g., crawl over JMS, EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other framework, etc.) * simpler Nutch deployment mechanisms (separate Nutch deployment package from source code package), think about using Maven2 +1 to all of those and other ideas for how to improve the project's focus. Cheers, Chris On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote: Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL aliasing) but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence. --- So, what do we need to do next? * we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;) * we need to re-architect Nutch to focus on our core competence, and delegate what we can to other projects.
Re: The Future of Nutch
On Wed, 2009-04-01 at 07:42 -0700, Ken Krugler wrote: ... I would suggest looking at Katta (http://katta.sourceforge.net/). It's one of several projects where the goal is to support very large Lucene indexes via distributed shards. Solr has also added federated search support. Interesting. Thanks for the link Ken. salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions Sociedad Andaluza para el Desarrollo de la Sociedad de la Información, S.A.U. (SADESI)
Re: The Future of Nutch
On Wed, Apr 1, 2009 at 17:42, Ken Krugler kkrugler_li...@transpac.comwrote: On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote: hi dennis: ... I am confident that hadoop can process the large datas of the www search engine! But lucene? I am afraid of the limited size of lucene's index per server is very little ,10G? or 30G? this is not enough for the www search engine! IMO, this is a bottleneck! I agree that the actual problem/solution of accessing lucene indexes is to keep them small. What does the possibility of having a clouded index serve if accessing it takes hours? For me here should lie one of nutch core competences: making search in BIG indexes fast (as fast as in SMALL indexes). I would suggest looking at Katta (http://katta.sourceforge.net/). It's one of several projects where the goal is to support very large Lucene indexes via distributed shards. Solr has also added federated search support. I agree. I think the new index framework should be flexible enough that we can support katta along with solr. Actually, this is one of the things I want to do before the next major release. -- Ken -- Ken Krugler +1 530-210-6378 -- Doğacan Güney
Re: The Future of Nutch
On Fri, 2009-03-13 at 19:42 -0700, buddha1021 wrote: hi dennis: ... I am confident that hadoop can process the large datas of the www search engine! But lucene? I am afraid of the limited size of lucene's index per server is very little ,10G? or 30G? this is not enough for the www search engine! IMO, this is a bottleneck! I agree that the actual problem/solution of accessing lucene indexes is to keep them small. What does the possibility of having a clouded index serve if accessing it takes hours? For me here should lie one of nutch core competences: making search in BIG indexes fast (as fast as in SMALL indexes). salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source consulting, training and solutions
Re: The Future of Nutch
On Fri, 2009-03-20 at 11:55 +0200, Doğacan Güney wrote: Hi, On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote: ... Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. ... So I think delegating nutch functionality to other projects (tika/droids/solr/etc) is a great idea (so nutch can focus on the brains as Dennis said), but I don't like the idea of separating nutch into pieces. I hoped to meet some nutch people at apacheCon to talk about this mail. Droids is ATM incubating with 2 sponsor projects HC and lucene. With nutch becoming TLP droids would be much more a nutch subproject then one of the before mentioned. I see the essence of this thread and the current reality of moving functionality away from nutch. Tika is the attempt to use the parser functionality outside from nutch. Droids uses tika for parsing and even so I would welcome that tika splits in different parser parts to reduce dependencies in droids. Hearing that droids could become nutch standard fetcher/crawler is really exciting. I invite everyone to join droids mailing list to make this happening. Droids is similar to tika an attempt to use crawling facility outside of nutch. Nutch core competence is: - indexing - searching where the focus is on: - make it happen in the cloud - on a BIG scale - with millions of slaves salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source consulting, training and solutions
Re: The Future of Nutch
Hey there, Just chiming in that we use the complete Nutch + Hadoop + Lucene stack -- we download pages, index them for keywords, and then do heavy Semantic Parsing on it to produce BI data. We also use a lot of plug-ins for parsing and ranking information. What we don't use is the 'built-in GUI search' ability... but besides that, the core of our business is evolving around Nutch :) 2009/3/20 Doğacan Güney doga...@gmail.com: Hi, On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote: With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. I think nutch-solr and nutch-hbase should be in one unified project :) I can understand the difficulty (for newcomers) if we start depending on too many external projects. It would certainly be confusing to have to start a solr server then hbase master/slaves just to be able to crawl one intranet website locally. On the other hand, if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings, I am worried we will have to create a waaay too generic interface to deal with them and not reap the advantages of using solr over lucene and hbase over hadoop. Also, more backends possibly mean more bugs and more integration problems. So I think delegating nutch functionality to other projects (tika/droids/solr/etc) is a great idea (so nutch can focus on the brains as Dennis said), but I don't like the idea of separating nutch into pieces. So I guess for a small vertical search engine, it may seem unnecessary to also deal with solr/etc, but as long as we have good documentation*, they are not that difficult to handle. And they don't have a large performance memory overhead. About vertical/large-scale search engine split: I guess a good example here is Dennis' FieldIndexer work. It is much more flexible for people who want to extend nutch's indexing architecture, but maybe overkill for people (and I am not convinced that it is) wanting to run vintage nutch on a small-scale. I, again, don't like splitting nutch into two(or three, four...) parts like this. But I think having different crawl paths for different users is much more manageable than having different architectures. So we always use solr/hbase/etc. as our architecture. But you can run a one-job indexer if you want or run FieldIndexer. You can use the on-the-fly scoring scheme or you use page rank/other complex offline scoring schemes. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Docs, docs, docs :D Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Thanks for starting the discussion Dennis. Dennis * And we don't have good documentation right now (and I am much to blame for it:). I think this should be an explicit goal for us in the future. I am thinking something like no major features without documentation in the wiki. -- Doğacan Güney
Re: The Future of Nutch
Guys, I thought I'd chime in here. I don't have a lot of time tonight (long day out here in California), but perhaps I can add more thoughts tomorrow. My +1 for moving Nutch into a TLP. With a 1.0 release, and several prior releases (~10), I think that the discussion is reasonable. I also tend to agree with Dennis's view regarding it being a positive thing to have a Nutch PMC. The project has been around since 2005, and whether activity has slowed recently of late, or not, there are still folks who are actively interested in Nutch, and use it in operational form on the day-to-day, myself included in that area. That said, I would like to revisit some of the ideas about the Next Generation Nutch discussion: http://markmail.org/message/mcnbgg7uf54snf55#query:next%20generation%20nutch %20mattmann+page:1+mid:ofk3ob3hv4djmrmn+state:results And use this as a spring board for some of the things we should really think about if we make Nutch a TLP. IMHO, these ideas really justify Nutch as a TLP because we: 1. have a 1.0 release (and several official 0.x releases and patch 0.x.y patch releases) 2. have the system in real-world operations 3. have a plan going forward for a next gen or 2.0 architecture As for Nutch being an integration platform for existing Lucene components, I think that Nutch should certainly make use of existing functionality where it makes sense (Tika, Solr, etc.), but we really need to take a hard look at insulating the core POJO model of Nutch (Brin and Page paper here folks, I'm talking the Anatomy of a Large-Scale Hypertextual Web Search Engine) from the underlying technology substrate. That would be my on my list of top goals for Nutch as a TLP. In fact, even thinking about this, I think it lends itself very nicely to a category of sub-projects (e.g., Nutch-Hadoop, Nutch-JMS, etc.) to think about from a TLP perspective. Anyways, just wanted to chime in. I'll add more tomorrow. Thanks, Chris On 3/17/09 7:05 PM, Marc Boucher marc.bouc...@hyperix.com wrote: Dennis, That adds another dimension to the issue which I had not considered. One avenue as you suggest would be to add another committer to the Lucene PMC. If that does not work them maybe going the route of TLP is the best option. Marc Part of this is about releases. Currently releases are voted on by Lucene PMC members and it takes 3 members to confirm a vote. There are only 2 Nutch committers on the Lucene PMC. So for releases, not that we have had many recently, other Lucene PMC members who may not be actively associated with Nutch would need to vote to release. If Nutch was a TLP there would be a Nutch PMC which would most likely include all current Nutch committers. The other may be to add another Nutch committer to the Lucene PMC. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Any and every contribution is welcome. Dennis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: The Future of Nutch
Hi, On Sat, Mar 14, 2009 at 02:19, Dennis Kubes ku...@apache.org wrote: With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. I think nutch-solr and nutch-hbase should be in one unified project :) I can understand the difficulty (for newcomers) if we start depending on too many external projects. It would certainly be confusing to have to start a solr server then hbase master/slaves just to be able to crawl one intranet website locally. On the other hand, if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings, I am worried we will have to create a waaay too generic interface to deal with them and not reap the advantages of using solr over lucene and hbase over hadoop. Also, more backends possibly mean more bugs and more integration problems. So I think delegating nutch functionality to other projects (tika/droids/solr/etc) is a great idea (so nutch can focus on the brains as Dennis said), but I don't like the idea of separating nutch into pieces. So I guess for a small vertical search engine, it may seem unnecessary to also deal with solr/etc, but as long as we have good documentation*, they are not that difficult to handle. And they don't have a large performance memory overhead. About vertical/large-scale search engine split: I guess a good example here is Dennis' FieldIndexer work. It is much more flexible for people who want to extend nutch's indexing architecture, but maybe overkill for people (and I am not convinced that it is) wanting to run vintage nutch on a small-scale. I, again, don't like splitting nutch into two(or three, four...) parts like this. But I think having different crawl paths for different users is much more manageable than having different architectures. So we always use solr/hbase/etc. as our architecture. But you can run a one-job indexer if you want or run FieldIndexer. You can use the on-the-fly scoring scheme or you use page rank/other complex offline scoring schemes. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Docs, docs, docs :D Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Thanks for starting the discussion Dennis. Dennis * And we don't have good documentation right now (and I am much to blame for it:). I think this should be an explicit goal for us in the future. I am thinking something like no major features without documentation in the wiki. -- Doğacan Güney
Re: The Future of Nutch
I actually use Nutch as a large scale search engine on two products. I think a few things that would be nice to have are built in options to produce an incremental index and maybe a quartz scheduler to automate it completely. One thing that would be nice is when one of us figures something out like doing an incremental index, we would create a document and post it to the wiki. Documentation has been one of the big hurdles for me. Thanks for all your hard work and I hope to contribute to the project soon. Alex --- On Fri, 3/13/09, Dennis Kubes ku...@apache.org wrote: From: Dennis Kubes ku...@apache.org Subject: The Future of Nutch To: nutch-user@lucene.apache.org Date: Friday, March 13, 2009, 7:19 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Dennis
Re: The Future of Nutch
Dennis, Otis et al, My very small team has kept silent for a long time. We've been playing with Nutch, Hadoop and to a lesser extent Solr for about 2 years now. Before I get into my thoughts on what direction things should take I would like to offer a thought on why Nutch is not as active as other groups. I think in part it's because what Nutch represents and that's the ability of creating a large scale search. Some developers would rather use Nutch and associated tools and keep quiet about it because of their goals, which in some case might mean competing against the likes of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to compete with those companies on large scale search but I can see competition in the vertical markets. And while Solr is hot these days it's intended primarily for the enterprise market which is very different than the large scale and vertical markets. Now on to the future. I agree with many of the thoughts Otis put forward. While Nutch has it's problems other than Heritrix there is no other open source system available and Nutch's ability to perform web-wide crawls must be preserved. However I'm thinking we should have modular approach to Nutch. For instance, why just one fetcher? Why not keep the current one but also allow for the possibility of using Droids? Parsing can and should include Tika. I'm not sure about outsourcing indexing and searching to Solr but that could be a modular option as well. I'm not sure if Nutch should become a top level project and move out from under Lucene. Lucene has great visibility and for many reasons. If Nutch was moved, would it still attract enough attention? It's been noted that developer interest in Nutch is different that Lucene, Solr etc. On the other hand it might do Nutch good to go TLP as maybe then it would attract more developers especially if it was packaged differently. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Marc Boucher http://hyperix.com On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: Hello, Comments inlined. - Original Message From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Yes, there are fewer parties doing large scale web crawling. Still, as there is no alternative fetcher+parser+indexer+searcher capable of handling large scale deployments like Nutch (or maybe Heritrix has the same scaling capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should be preserved. Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. That's my experience, too. I think we can have both under the same Nutch roof. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? I disagree, at least in the near term. There is nothing preventing those sub-projects existing under Nutch today. Both Solr and Lucene have the contrib area where similar sub-projects live. I think it's not a matter of being a TLP, but rather attracting enough developer interest, then user interest, and then contributor interest, so that these sub-projects can be created, maintained, advanced. Right now, Solr gets a TON of attention, as does Lucene. Nutch gets the least developer attention, and for some reason the nutch-user subscribers feel a bit different from solr-user or java-user subscribers. Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the
Re: The Future of Nutch
Marc, Glad you responded. Always good to hear peoples thoughts. Marc Boucher wrote: Dennis, Otis et al, My very small team has kept silent for a long time. We've been playing with Nutch, Hadoop and to a lesser extent Solr for about 2 years now. Before I get into my thoughts on what direction things should take I would like to offer a thought on why Nutch is not as active as other groups. I think in part it's because what Nutch represents and that's the ability of creating a large scale search. Some developers would rather use Nutch and associated tools and keep quiet about it because of their goals, which in some case might mean competing against the likes of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to compete with those companies on large scale search but I can see competition in the vertical markets. And while Solr is hot these days it's intended primarily for the enterprise market which is very different than the large scale and vertical markets. I completely agree. The group of people/companies that are creating large scale search solutions whether whole web or vertical is much smaller than say enterprise search or even the potential uses for Hadoop. Now on to the future. I agree with many of the thoughts Otis put forward. While Nutch has it's problems other than Heritrix there is no other open source system available and Nutch's ability to perform web-wide crawls must be preserved. However I'm thinking we should have modular approach to Nutch. For instance, why just one fetcher? Why not keep the current one but also allow for the possibility of using Droids? Parsing can and should include Tika. I'm not sure about outsourcing indexing and searching to Solr but that could be a modular option as well. Yup. It should IMO also be easy to install and configure. I was having a discussion today where the main topic was, could we make Nutch have a nice graphical web interface for configuration, when you could drop it in, change some options, and create a customized vertical search over x domains? I'm not sure if Nutch should become a top level project and move out from under Lucene. Lucene has great visibility and for many reasons. If Nutch was moved, would it still attract enough attention? It's been noted that developer interest in Nutch is different that Lucene, Solr etc. On the other hand it might do Nutch good to go TLP as maybe then it would attract more developers especially if it was packaged differently. Part of this is about releases. Currently releases are voted on by Lucene PMC members and it takes 3 members to confirm a vote. There are only 2 Nutch committers on the Lucene PMC. So for releases, not that we have had many recently, other Lucene PMC members who may not be actively associated with Nutch would need to vote to release. If Nutch was a TLP there would be a Nutch PMC which would most likely include all current Nutch committers. The other may be to add another Nutch committer to the Lucene PMC. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Any and every contribution is welcome. Dennis Marc Boucher http://hyperix.com On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: Hello, Comments inlined. - Original Message From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Yes, there are fewer parties doing large scale web crawling. Still, as there is no alternative fetcher+parser+indexer+searcher capable of handling large scale deployments like Nutch (or maybe Heritrix has the same scaling capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should be preserved. Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. That's my experience, too. I think we can have both under the same Nutch roof. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase.
Re: The Future of Nutch
Dennis, That adds another dimension to the issue which I had not considered. One avenue as you suggest would be to add another committer to the Lucene PMC. If that does not work them maybe going the route of TLP is the best option. Marc Part of this is about releases. Currently releases are voted on by Lucene PMC members and it takes 3 members to confirm a vote. There are only 2 Nutch committers on the Lucene PMC. So for releases, not that we have had many recently, other Lucene PMC members who may not be actively associated with Nutch would need to vote to release. If Nutch was a TLP there would be a Nutch PMC which would most likely include all current Nutch committers. The other may be to add another Nutch committer to the Lucene PMC. My thoughts. And hopefully in the near future my small team will be able to contribute to Nutch in a meaningful way. Any and every contribution is welcome. Dennis
Re: The Future of Nutch
Hello, Comments inlined. - Original Message From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Yes, there are fewer parties doing large scale web crawling. Still, as there is no alternative fetcher+parser+indexer+searcher capable of handling large scale deployments like Nutch (or maybe Heritrix has the same scaling capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should be preserved. Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. That's my experience, too. I think we can have both under the same Nutch roof. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? I disagree, at least in the near term. There is nothing preventing those sub-projects existing under Nutch today. Both Solr and Lucene have the contrib area where similar sub-projects live. I think it's not a matter of being a TLP, but rather attracting enough developer interest, then user interest, and then contributor interest, so that these sub-projects can be created, maintained, advanced. Right now, Solr gets a TON of attention, as does Lucene. Nutch gets the least developer attention, and for some reason the nutch-user subscribers feel a bit different from solr-user or java-user subscribers. Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. My feeling has long been that indexing and searching should be outsourced to Solr, parsing to Tika, and that the fetcher should probably be replaced with Droids. I say probably because I'm not very familiar with Droids just yet. Nutch should, I think, then be an application built with all those components combined (is that what you mean by the shell?), and then apply its knowledge of either web-wide scale trickery, or vertical SE trickery, or ... I think that's where the brains are needed, to tie it all together, while still making certain pieces swappable and more easily digestible by potential new contributors and developers, as well as users. I know plugins do some of that already, but it seems like there might still be more in the fore than there should/could be... And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? I am not 100% sure, but I think it's a bit of all of the above. Lucene has been around for 10 years and from day one had people answer questions from the most basic ones to the trickiest ones. It's the same with Solr today. Nutch has the least active and the smallest developer base, so questions don't get answered. Again, people on this list also tend to have a different style of asking questions - no hellos, no thank yous, and so on, which doesn't help. I think the existence of a book on Lucene helped Lucene, but Solr doesn't yet have a book, yet it still has a healthy developer and user community. I think that's because Solr is simply more needed by more people than Nutch is. Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. I think it's good you started this discussion. My opinion about what needs to be done with Nutch is above. I
Re: The Future of Nutch
I just wish there could be some clear documentation for Nutch/Solr integration publicly available. Or some developers are already working on this? - Tony On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic ogjunk-nu...@yahoo.comwrote: Hello, Comments inlined. - Original Message From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 8:19:37 PM With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Yes, there are fewer parties doing large scale web crawling. Still, as there is no alternative fetcher+parser+indexer+searcher capable of handling large scale deployments like Nutch (or maybe Heritrix has the same scaling capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should be preserved. Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. That's my experience, too. I think we can have both under the same Nutch roof. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? I disagree, at least in the near term. There is nothing preventing those sub-projects existing under Nutch today. Both Solr and Lucene have the contrib area where similar sub-projects live. I think it's not a matter of being a TLP, but rather attracting enough developer interest, then user interest, and then contributor interest, so that these sub-projects can be created, maintained, advanced. Right now, Solr gets a TON of attention, as does Lucene. Nutch gets the least developer attention, and for some reason the nutch-user subscribers feel a bit different from solr-user or java-user subscribers. Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. My feeling has long been that indexing and searching should be outsourced to Solr, parsing to Tika, and that the fetcher should probably be replaced with Droids. I say probably because I'm not very familiar with Droids just yet. Nutch should, I think, then be an application built with all those components combined (is that what you mean by the shell?), and then apply its knowledge of either web-wide scale trickery, or vertical SE trickery, or ... I think that's where the brains are needed, to tie it all together, while still making certain pieces swappable and more easily digestible by potential new contributors and developers, as well as users. I know plugins do some of that already, but it seems like there might still be more in the fore than there should/could be... And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? I am not 100% sure, but I think it's a bit of all of the above. Lucene has been around for 10 years and from day one had people answer questions from the most basic ones to the trickiest ones. It's the same with Solr today. Nutch has the least active and the smallest developer base, so questions don't get answered. Again, people on this list also tend to have a different style of asking questions - no hellos, no thank yous, and so on, which doesn't help. I think the existence of a book on Lucene helped Lucene, but Solr doesn't yet have a book, yet it still has a healthy developer and user community. I think that's because Solr is simply more needed by more people than Nutch is. Any
Re: The Future of Nutch
Hi: I also agree that the most usage scenarios of nutch are in vertical search area. and in some unusual case users may don't even use nutch indexing at all. they just crawl some pages as mirror purpose. and in some cases of vertical search, user only need a fraction of pages, e.g. house rent info, restraunt info. so how about distribute nutch as components so that crawler can be indepently used without indexing. that's actually what droid is trying to address. but nutch can also do it in a more scalable way. another point is that, if nutch need to be more easily customized for special cases such as vertical search, new ranking machenism must be introduced. tf/idf just can not work. maybe machine learning scheme such as text classifier can be employed. it is great for nutch to be a top apache project, because subprojects for special case can be created for easier customization. i have also seen posts about using spring as nutch components assemely framework. maybe it can be created as subproject for spring users. just my 2 cents good luck yanky 2009/3/14 buddha1021 buddha1...@yahoo.cn hi dennis: Nutch's original intention was as a large-scale www search engine. I am very agreeing with you! Dennis! nutch's goal is specificly that achives the goal like google to process the large-scale datas! There is no doubt that nutch will be a www search engine absolutely,but absolutely not a vertical search ! I am confident that hadoop can process the large datas of the www search engine! But lucene? I am afraid of the limited size of lucene's index per server is very little ,10G? or 30G? this is not enough for the www search engine! IMO, this is a bottleneck! how many pages do visvo search currently? 100 millions? or 1000 millions? IMO ,it will be very good that moving Nutch to a top level apache project out from under the Lucene umbrella ! but all the sub-projects of nutch should be active enough, if not, nutch's develop will be slow and it is no good for nutch's unity. So the number of the sub-projects should be less ! and the sub-projects should be active ,efficient and also strong enough ! Good luck ! Dennis Kubes-2 wrote: With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Dennis -- View this message in context: http://www.nabble.com/The-Future-of-Nutch-tp22507507p22508747.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: The Future of Nutch
I am using Nutch for more than four years now, as a vertical search engine, having indexed, some times, over one million pages. On the other hand, I dont know nothing about programming and some specialized aplications. Words like solr and others are like aliens for me. I am just interested in a search engine that someone can, really, use and not an application that serve as a base for developping sophisticated models. So, what I, personally want for the future of Nutch is that it does not turn in such a complicated aplication that just some very skilled people can use. So I hope that Nutch keeps, allways, an eye on the real users, that want it for plain searching. Thanks - Original Message - From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 9:19 PM Subject: The Future of Nutch With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Dennis No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.13/2001 - Release Date: 03/14/09 06:54:00
Re: The Future of Nutch
I think that this would be the case for making Nutch a top level Apache Project. So that you can publish the framework and a complete app but still tie it all together. Because personally I think that is the strength of Nutch, that you can use it right out of the box, without programming. But all of extensibility (customization) is there so that you can extend it if you so desire. -John On Mar 14, 2009, at 9:44 AM, consultas wrote: I am using Nutch for more than four years now, as a vertical search engine, having indexed, some times, over one million pages. On the other hand, I dont know nothing about programming and some specialized aplications. Words like solr and others are like aliens for me. I am just interested in a search engine that someone can, really, use and not an application that serve as a base for developping sophisticated models. So, what I, personally want for the future of Nutch is that it does not turn in such a complicated aplication that just some very skilled people can use. So I hope that Nutch keeps, allways, an eye on the real users, that want it for plain searching. Thanks - Original Message - From: Dennis Kubes ku...@apache.org To: nutch-user@lucene.apache.org Sent: Friday, March 13, 2009 9:19 PM Subject: The Future of Nutch With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Dennis No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.13/2001 - Release Date: 03/14/09 06:54:00
Re: The Future of Nutch
Dennis, I am with you, I am building a large scale www search engine. But might also build a vertical search as well. Aren't the requirements the same for building a large scale www search, against building a vertical www search, the only thing that seems to change is the scope. I like the idea of making nutch work with multiple types of crawlers (maybe a crawler pluginkind of thing). I have looked at Droids and it seems interesting. Regarding the SOLR integration I am not sure that I agree with on that point. As I have considered using the SOLR integration for my WWW index. And the main reasons are that SOLR seems to have stronger search engine features at this point, like faceting, collapsing, synonyms, spelling, etc. but Nutch clearly has crawling and processing large amounts of data into a index down pat. Regarding the MapReduce, if it is good enough for Google, then it is good enough for Nutch. I think that if you segment Nutch into too many sub projects you lose the flexibility or ability to have a good single solid, scaleable search engine. Just my .02 cents. -John On Mar 13, 2009, at 6:19 PM, Dennis Kubes wrote: With the release of Nutch 1.0 I think it is a good time to begin a discussion about the future of Nutch. Here are some things to consider and would love to here everyones views on this Nutch's original intention was as a large-scale www search engine. That is a very specific goal. Only a few people and organizations actually use it on that level. (I just happen to be one of them as most of my work focuses on large scale web search as opposed to vertical search). Many, perhaps most, people using Nutch these days are either using parts of Nutch, such as the crawler, or are targeting towards vertical or intranet type search engines. This can be seen in how many people have already started using the Solr integration features. So while Nutch was originally intended as a www search, IMO most people aren't using it for that purpose. Since there are different purposes for different users, would it be good to consider moving Nutch to a top level apache project out from under the Lucene umbrella? This would then allow the creation of nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts? Many parts of Nutch have also been implemented in other projects. For example, Tika for the parsers, Droids for the Crawler. In begs the question what is Nutch's core features going forward. When I think about search (again my perspective is large scale), I think crawling or acquisition of data, parsing, analysis, indexing, deployment, and searching. I personally think that there is much room for improvement in crawling and especially analysis. Nutch shouldn't just be about the shell but also the brains. And one of the biggest things I see is many newcomers to nutch have a very hard time getting started. Part of this is understanding mapreduce mentality, part is documentation, part is there is only so much time some of us have to answer questions so some questions go unanswered on the lists. How might this be improved going forward? Any other thoughts also welcome. Really I want to start a discussion about where everyone thinks we are with the state of Nutch and its future. Dennis