(Moving the discussion to nutch-dev, please drop the cc: when responding)
Doug Cutting wrote:
Andrzej Bialecki wrote:
It's nice to have these couple percent... however, it doesn't solve
the main problem; I need 50 or more percent increase... :-) and I
suspect this can be achieved only by
Dear all,
Currently I'm using the Nutch plug-in clustering-carrot2 and would like to
ask for some help. When I built the search result clusters, only the search
results that occurred twice or more will be grouped into one cluster. At the
same time, if some results(keywords) only occur once,
[
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359725 ]
Stefan Groschupf commented on NUTCH-133:
Doug,
ok, I will split things in different patches and open a set of new bugs.
Jerome:
If you take a carefully look to my
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ]
Stefan Groschupf closed NUTCH-133:
--
Resolution: Won't Fix
We will split the problems described here into a set of bugs to fix things step
by step.
ParserFactory does not work as
Hi,
I started to think about implementing special kind of Lucene Query (if I
remember correctly I would have to write my own Scorer and probably a few
other classes) optimized for Nutch some time ago. I assumed having
specialized query I would be able to avoid accessing some of lucene index
[
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ]
Jerome Charron commented on NUTCH-133:
--
Stefan:
Taking a closer look at the ParserFactory patch:
1. You can use the MimeType.clean(String) static method to clean the
Doug Cutting wrote:
Implementing something like this for Lucene would not be too difficult.
The index would need to be re-sorted by document boost: documents would
be re-numbered so that highly-boosted documents had low document
numbers.
In particular, one could:
1. Create an array of
Hey folks,
We're looking at launching a search engine in the beginning of the new
year that will eventually grow to being a multi-billion page index.
Three questions:
First, and most important for now, does anyone have any useful numbers
for what the hardware requirements are to run such an
When you run multiple commands within nutch it seems to process the
pending tasks in the order that they were added to the queue. In some
cases this means you may be 50% through many jobs (complete map but not
reduce) while processes maps for yet more jobs.
I think Nutch should prioritize a