Hi,

The proposal about the query filters is quite interesting. I have run a 
simple test for the performance comparison between one field and multi 
field queries in a 200k+ index. I have put logs in 
DistributedSearch$Client as :

long start = System.currentTimeMillis();
Hits hits = client.search(query, 10, null, null, false);
System.out.println("time:" + (System.currentTimeMillis() - start));

I have seen 2 or 3 times improvement for one field queries :

site:term1  vs term1

Since I did not implemented query filters searching content or title or 
host. However I suppose that the improvement is due to the fairly low 
number of documents that returned from the former query. I suppose that 
most of the latency is due to RPC calls. We could measure the real index 
searching time from the NutchBean's main method, however then the index 
readers are not warmed up, we should run dummy queries first before 
measuring.

Andrej, have you tested the runtimes of these complex queries? If so 
would you share the results?

 >Looking at the code of Lucene's TermQuery and TermScorer it should be 
possible to score the document differently
 >depending on the absolute term position (we would just need to 
retrieve TermPositions instead of TermDocs, as
 >TermQuery does now). The only problem now is to determine efficiently 
which sub-field are we in at the moment,
 > in order to apply different boosts.

I see another possible issue about using a single field. tf and idf 
values will be changed when using multiple field data are mashed-up in a 
field. Thus this will affect scoring. The Similarity interface of lucene 
does not pass the fieldname as an argument to tf() and idf(). This 
interface should be changed also. However, a single field or not, I 
think the Similarity#tf() and Similarity#idf() should always pass 
fieldname. So that implementations of Similarity, such as 
NutchSimilarity, can handle per-field tf, idf generation code. Such a 
think can be useful, such as in the type field, in which case idf of a 
type does not make so much sense. What do you think, should we propose 
this to lucene community?



Andrzej Bialecki wrote:
> Hi all,
>
> This very long post is meant to initiate a discussion. There is no 
> code yet. Be warned that it discusses low-level Nutch/Lucene stuff.
>
> Nutch queries are currently translated into complex Lucene queries. 
> This is necessary in order to take into account score factors coming 
> from various document parts, such as URL, host, title, content, and 
> anchors.
>
> Typically, the translation provided by query-basic looks like this for 
> single term queries:
>
> (1)
> Query: term1
> Parsed: term1
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0)
>
> For queries consisting of two or more terms it looks like this (Nutch 
> uses implicit AND):
>
> (2)
> Query: term1 term2
> Parsed: term1 term2
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
> content:term2 title:term2^1.5 host:term2^2.0) url:"term1 
> term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 
> term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 
> term2"~2147483647^2.0
>
> By the way, please note the absurd default slop value - in case of 
> anchors it defeats the purpose of having the ANCHOR_GAP ...
>
> Let's list other common query types:
>
> (3)
> Query: term1 term2 term3
> Parsed: term1 term2 term3
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
> content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
> anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
> url:"term1 term2 term3"~2147483647^4.0 anchor:"term1 term2 
> term3"~4^2.0 content:"term1 term2 term3"~2147483647 title:"term1 term2 
> term3"~2147483647^1.5 host:"term1 term2 term3"~2147483647^2.0
>
> For phrase queries it looks like this:
>
> (4)
> Query: "term1 term2"
> Parsed: "term1 term2"
> Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
> content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0)
>
> For mixed term and phrase queries it looks like this:
>
> (5)
> Query: term1 "term2 term3"
> Parsed: term1 "term2 term3"
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) +(url:"term2 term3"^4.0 anchor:"term2 
> term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
> term3"^2.0)
>
> For queries with NOT operator it looks like this:
>
> (6)
> Query: term1 -term2
> Parsed: term1 -term2
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
> content:term2 title:term2^1.5 host:term2^2.0)
>
> (7)
> Query: term1 term2 -term3
> Parsed: term1 term2 -term3
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
> content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
> anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
> url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 
> content:"term1 term2"~2147483647 title:"term1 term2"~2147483647^1.5 
> host:"term1 term2"~2147483647^2.0
>
> (8)
> Query: "term1 term2" -term3
> Parsed: "term1 term2" -term3
> Translated: +(url:"term1 term2"^4.0 anchor:"term1 term2"^2.0 
> content:"term1 term2" title:"term1 term2"^1.5 host:"term1 term2"^2.0) 
> -(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 
> host:term3^2.0)
>
> (9)
> Query: term1 -"term2 term3"
> Parsed: term1 -"term2 term3"
> Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
> title:term1^1.5 host:term1^2.0) -(url:"term2 term3"^4.0 anchor:"term2 
> term3"^2.0 content:"term2 term3" title:"term2 term3"^1.5 host:"term2 
> term3"^2.0)
>
>
> WHEW ... !!! Are you tired? Well, Lucene is tired of these queries 
> too. They are too long! They are absurdly long and complex. For large 
> indexes the time to evaluate them may run into several seconds, which 
> in most cases is unacceptable. Please note that we haven't yet 
> considered other query plugins - however, their impact may be 
> controlled to some extent through a built-in cache of QueryFilters in 
> LuceneQueryOptimizer.
>
> We can use the trick with sorted postings by PageRank score (see 
> IndexSorter for details) and limiting the scanning of postings to 
> top-k documents, but it works only when you have a clear idea which 
> documents are more important in a query-independent fashion.
>
> I've been thinking about how to optimize these queries so that they 
> have a simpler structure.
>
> My initial idea was this: instead of building a multi-clause 
> BooleanQuery consisting of the same term in multiple fields perhaps we 
> could implement a special kind of term query, and build a special 
> index field for this type of query.
>
> Let's see what fields are involved in each multi-field query:
>
> +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 
> host:term1^2.0)
>
> We have the following, each with different boosts:
>
> * url (stored, indexed, tokenized)
> * anchor (un-stored, indexed, tokenized)
> * content (un-stored, indexed, tokenized)
> * title (stored, indexed, tokenized)
> * host (un-stored, indexed, tokenized)
>
> If we simply concatenated all text together we could've used a simple 
> TermQuery,, but this would have several disadvantages:
>
> * field's lengthNorm would be skewed because now the length would 
> depend on the total length of all concatenated fields
>
> * we would lose the ability to boost terms matching in different parts 
> of the document.
>
> So this solution is not acceptable.
>
> (Please bear with me - I'm coming now to a proposed solution ...)
>
> Still it's tempting to be able to use a single field ... If we are to 
> use a single field approach, then we need a way to mark boundaries of 
> each field content, and apply different boosts to different parts. The 
> first requirement can be implemented by using positionIncrement (gaps 
> in tokens), the second requirement perhaps may be implemented by a 
> custom Lucene Query implementation.
>
> Looking at the code of Lucene's TermQuery and TermScorer it should be 
> possible to score the document differently depending on the absolute 
> term position (we would just need to retrieve TermPositions instead of 
> TermDocs, as TermQuery does now). The only problem now is to determine 
> efficiently which sub-field are we in at the moment, in order to apply 
> different boosts.
>
> One idea that I have is to artificially limit the maximum size of each 
> field from the above list, and put their content in the "super-field" 
> in the following order:
>
> * url - 100 terms
> * title - 100 terms
> * host - 20 terms
> * content - 10,000 terms
> * anchor - unlimited
>
> This way, as the scorer goes through TermPositions, for e.g. term at 
> position 15 it would apply the url boost, for term at position 115 it 
> would apply the title boost, and so on.
>
> This should solve the issue of applying different query boosts for 
> different document parts. The only issue to tackle is the lengthNorm 
> (actually, lengthNorm * fieldBoost, because they are saved together as 
> a single float). I don't have any good answer to that right now ... we 
> would have to store 5 different norm values for this super-field, 
> which is not supported by Lucene now ... perhaps we could build this 
> information outside the index and make it somehow accessible to the 
> scorer.
>
> Sooooooo .... assuming we could implement all of the above, let's see 
> how this would simplify the queries above (I use surrounding angle 
> brackets to indicate our NutchSuperQuery, and the name of the 
> super-field is "super"):
>
> (1)
> Query: term1
> Parsed: term1
> Translated: +super:<term1>
>
> Wow! that's a great simplification. The cost of this query should be 
> close to a TermQuery.
>
> (2)
> Query: term1 term2
> Parsed: term1 term2
> Translated: +super:<term1> +super:<term2> super:<"term1 
> term2"~2147483647>
>
> (the above would mean a nested PhraseQuery inside the NutchSuperQuery).
>
> (3)
> Query: term1 term2 term3
> Parsed: term1 term2 term3
> Translated: +super:<term1> +super:<term2> +super:<term3> super:<"term1 
> term2 term3"~2147483647>
>
> (4)
> Query: "term1 term2"
> Parsed: "term1 term2"
> Translated: +super:<"term1 term2">
>
> (5)
> Query: term1 "term2 term3"
> Parsed: term1 "term2 term3"
> Translated: +super:<term1> +super:<"term2 term3">
>
> (6)
> Query: term1 -term2
> Parsed: term1 -term2
> Translated: +super:<term1> -super:<term2>
>
> (7)
> Query: term1 term2 -term3
> Parsed: term1 term2 -term3
> Translated: +super:<term1> +super:<term2> -super:<term3> super:<"term1 
> term2"~2147483647>
>
> (8)
> Query: "term1 term2" -term3
> Parsed: "term1 term2" -term3
> Translated: +super:<"term1 term2"> -super:<term3>
>
> (9)
> Query: term1 -"term2 term3"
> Parsed: term1 -"term2 term3"
> Translated: +super:<term1> -super:<"term2 term3">
>
> The degree of simplification is very substantial. Our NutchSuperQuery 
> doesn't have to do much more work than a simple TermQuery, so we can 
> assume that the cost to run it is the same as TermQuery times some 
> constant. What we gain then is the cost of not running all those 
> boolean clauses ...
>
> If you're still with me at this point I must congratulate you. :) 
> However, that's as far as I thought it through for now - let the 
> discussion start! If you are a Lucene hacker I would gladly welcome 
> your review or even code contributions .. ;)
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to