Dennis Kubes wrote:

>> That's a very nice description - thanks, Dennis. I think it would be 
>> useful to include it on the Wiki as a case study.
> 
> I will polish it up a bit and put it out there.


Great, thanks.


>>> This is all dependent on the size of each local index.  Approximately 
>>> 2-4M pages per index split is good.  Over that you may see 
>>> performance decreases.  Scaling that out over many servers you will 
>>> see almost linear response time.  We have almost 100M pages in the 
>>> index and are seeing subsecond response times on most queries.
>>
>> Are you running with a sorted index, and using non-zero 
>> searcher.max.hits? If you use a well-defined PR-like scoring, then 
>> using this feature could make wonders to the performance, and increase 
>> the max number of docs per server.
> 
> I don't know about the sorted index.  How do I learn about that?
> 
> We basically took the current indexer and extended it to split into 
> parts.  The indexer also splits the segements and linkdb into the same 
> parts so all data for a single url will be in the same split on the same 
> search server.  We are using searcher.max.hits at 1000 and we did see a 
> performance increase from that.

If you're using non-zero searcher.max.hits with un-sorted indexes, your 
ranking will be broken, i.e. the code in LuceneQueryOptimizer will make 
wrong assumptions about the extrapolation of scores for skipped 
documents. This feature strongly relies on having indexes sorted by 
PageRank score - see the IndexSorter tool for details. If you don't sort 
the index by PageRank, you should set this property to <= 0.

Try also upgrading Nutch to Lucene 2.2.0, this alone should give you a 
performance boost of a few percent (if Lucene indeed is the bottleneck).

See also my (long) rant about the complexity of Nutch queries: 
http://www.nabble.com/Performance-optimization-for-Nutch-index---query-tf3276316.html#a9111523


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to