Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
(Moving the discussion to nutch-dev, please drop the cc: when responding) Doug Cutting wrote: Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by

about the question of clustering-carrot2

2005-12-08 Thread charlie
Dear all, Currently I'm using the Nutch plug-in clustering-carrot2 and would like to ask for some help. When I built the search result clusters, only the search results that occurred twice or more will be grouped into one cluster. At the same time, if some results(keywords) only occur once,

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359725 ] Stefan Groschupf commented on NUTCH-133: Doug, ok, I will split things in different patches and open a set of new bugs. Jerome: If you take a carefully look to my

[jira] Closed: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ] Stefan Groschupf closed NUTCH-133: -- Resolution: Won't Fix We will split the problems described here into a set of bugs to fix things step by step. ParserFactory does not work as

Re: Lucene performance bottlenecks

2005-12-08 Thread Piotr Kosiorowski
Hi, I started to think about implementing special kind of Lucene Query (if I remember correctly I would have to write my own Scorer and probably a few other classes) optimized for Nutch some time ago. I assumed having specialized query I would be able to avoid accessing some of lucene index

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ] Jerome Charron commented on NUTCH-133: -- Stefan: Taking a closer look at the ParserFactory patch: 1. You can use the MimeType.clean(String) static method to clean the

Re: Lucene performance bottlenecks

2005-12-08 Thread Doug Cutting
Doug Cutting wrote: Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. In particular, one could: 1. Create an array of

nutch questions

2005-12-08 Thread Ken van Mulder
Hey folks, We're looking at launching a search engine in the beginning of the new year that will eventually grow to being a multi-billion page index. Three questions: First, and most important for now, does anyone have any useful numbers for what the hardware requirements are to run such an

Should nutch try to reduce first?

2005-12-08 Thread Rod Taylor
When you run multiple commands within nutch it seems to process the pending tasks in the order that they were added to the queue. In some cases this means you may be 50% through many jobs (complete map but not reduce) while processes maps for yet more jobs. I think Nutch should prioritize a