[Nutch-dev] Performance optimization for Nutch index / query

Andrzej Bialecki Thu, 22 Feb 2007 16:44:34 -0800

Hi all,

This very long post is meant to initiate a discussion. There is no code 
yet. Be warned that it discusses low-level Nutch/Lucene stuff.


Nutch queries are currently translated into complex Lucene queries. This 
is necessary in order to take into account score factors coming from 
various document parts, such as URL, host, title, content, and anchors.

Typically, the translation provided by query-basic looks like this for 
single term queries:

(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0)

For queries consisting of two or more terms it looks like this (Nutch 
uses implicit AND):

(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) url:"term1 
term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 
term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 
term2"~2147483647^2.0

By the way, please note the absurd default slop value - in case of 
anchors it defeats the purpose of having the ANCHOR_GAP ...

Let's list other common query types:

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:"term1 term2 term3"~2147483647^4.0 anchor:"term1 term2 term3"~4^2.0 
content:"term1 term2 term3"~2147483647 title:"term1 term2 
term3"~2147483647^1.5 host:"term1 term2 term3"~2147483647^2.0

For phrase queries it looks like this:

(4)
Query: "term1 term2"
Parsed: "term1 term2"
Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0)

For mixed term and phrase queries it looks like this:

(5)
Query: term1 "term2 term3"
Parsed: term1 "term2 term3"
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:"term2 term3"^4.0 anchor:"term2 
term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
term3"^2.0)

For queries with NOT operator it looks like this:

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0)

(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 
content:"term1 term2"~2147483647 title:"term1 term2"~2147483647^1.5 
host:"term1 term2"~2147483647^2.0

(8)
Query: "term1 term2" -term3
Parsed: "term1 term2" -term3
Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0) 
-(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 
host:term3^2.0)

(9)
Query: term1 -"term2 term3"
Parsed: term1 -"term2 term3"
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:"term2 term3"^4.0 anchor:"term2 
term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
term3"^2.0)


WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too. 
They are too long! They are absurdly long and complex. For large indexes 
the time to evaluate them may run into several seconds, which in most 
cases is unacceptable. Please note that we haven't yet considered other 
query plugins - however, their impact may be controlled to some extent 
through a built-in cache of QueryFilters in LuceneQueryOptimizer.

We can use the trick with sorted postings by PageRank score (see 
IndexSorter for details) and limiting the scanning of postings to top-k 
documents, but it works only when you have a clear idea which documents 
are more important in a query-independent fashion.

I've been thinking about how to optimize these queries so that they have 
a simpler structure.

My initial idea was this: instead of building a multi-clause 
BooleanQuery consisting of the same term in multiple fields perhaps we 
could implement a special kind of term query, and build a special index 
field for this type of query.

Let's see what fields are involved in each multi-field query:

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 
host:term1^2.0)

We have the following, each with different boosts:

* url (stored, indexed, tokenized)
* anchor (un-stored, indexed, tokenized)
* content (un-stored, indexed, tokenized)
* title (stored, indexed, tokenized)
* host (un-stored, indexed, tokenized)

If we simply concatenated all text together we could've used a simple 
TermQuery,, but this would have several disadvantages:

* field's lengthNorm would be skewed because now the length would depend 
on the total length of all concatenated fields

* we would lose the ability to boost terms matching in different parts 
of the document.

So this solution is not acceptable.

(Please bear with me - I'm coming now to a proposed solution ...)

Still it's tempting to be able to use a single field ... If we are to 
use a single field approach, then we need a way to mark boundaries of 
each field content, and apply different boosts to different parts. The 
first requirement can be implemented by using positionIncrement (gaps in 
tokens), the second requirement perhaps may be implemented by a custom 
Lucene Query implementation.

Looking at the code of Lucene's TermQuery and TermScorer it should be 
possible to score the document differently depending on the absolute 
term position (we would just need to retrieve TermPositions instead of 
TermDocs, as TermQuery does now). The only problem now is to determine 
efficiently which sub-field are we in at the moment, in order to apply 
different boosts.

One idea that I have is to artificially limit the maximum size of each 
field from the above list, and put their content in the "super-field" in 
the following order:

* url - 100 terms
* title - 100 terms
* host - 20 terms
* content - 10,000 terms
* anchor - unlimited

This way, as the scorer goes through TermPositions, for e.g. term at 
position 15 it would apply the url boost, for term at position 115 it 
would apply the title boost, and so on.

This should solve the issue of applying different query boosts for 
different document parts. The only issue to tackle is the lengthNorm 
(actually, lengthNorm * fieldBoost, because they are saved together as a 
single float). I don't have any good answer to that right now ... we 
would have to store 5 different norm values for this super-field, which 
is not supported by Lucene now ... perhaps we could build this 
information outside the index and make it somehow accessible to the scorer.

Sooooooo .... assuming we could implement all of the above, let's see 
how this would simplify the queries above (I use surrounding angle 
brackets to indicate our NutchSuperQuery, and the name of the 
super-field is "super"):

(1)
Query: term1
Parsed: term1
Translated: +super:<term1>

Wow! that's a great simplification. The cost of this query should be 
close to a TermQuery.

(2)
Query: term1 term2
Parsed: term1 term2
Translated: +super:<term1> +super:<term2> super:<"term1 term2"~2147483647>

(the above would mean a nested PhraseQuery inside the NutchSuperQuery).

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +super:<term1> +super:<term2> +super:<term3> super:<"term1 
term2 term3"~2147483647>

(4)
Query: "term1 term2"
Parsed: "term1 term2"
Translated: +super:<"term1 term2">

(5)
Query: term1 "term2 term3"
Parsed: term1 "term2 term3"
Translated: +super:<term1> +super:<"term2 term3">

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +super:<term1> -super:<term2>

(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +super:<term1> +super:<term2> -super:<term3> super:<"term1 
term2"~2147483647>

(8)
Query: "term1 term2" -term3
Parsed: "term1 term2" -term3
Translated: +super:<"term1 term2"> -super:<term3>

(9)
Query: term1 -"term2 term3"
Parsed: term1 -"term2 term3"
Translated: +super:<term1> -super:<"term2 term3">

The degree of simplification is very substantial. Our NutchSuperQuery 
doesn't have to do much more work than a simple TermQuery, so we can 
assume that the cost to run it is the same as TermQuery times some 
constant. What we gain then is the cost of not running all those boolean 
clauses ...

If you're still with me at this point I must congratulate you. :) 
However, that's as far as I thought it through for now - let the 
discussion start! If you are a Lucene hacker I would gladly welcome your 
review or even code contributions .. ;)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Performance optimization for Nutch index / query

Reply via email to