Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Erik Hatcher Thu, 28 Jul 2005 10:55:41 -0700


On Jul 28, 2005, at 12:37 PM, Chris May wrote:

Works beautifully (at least on my 30K-document test index ). I'llneed to do some fiddling if I want to allow partial URLs (i.e.http://www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/about) but I can see how to do that, I think (and I'm not sure Ineed it anyway).
 Thanks Scott!
Incidentally, is there an easy way to make QueryParser not treatthe colon in 'http://' as a term separator? It seems that URLS getbroken into two chunks ('http' and 'www.warwick.ac.uk/somewhere')before they get fed to my custom analyzer. I got round it by justconstructing the PhraseQuery by hand, but I wonder if there's aneasier way ?

I'm not sure what string you're passing to QP, but the : denotes afield selector (such as title:lucene). There is no easy way forQueryParser to deal with that differently - it'd be custom parser atthat point. You can backslash escape it \:, but that is probably notdesirable. Or you could pre-process the string from the user beforehanding it to QP and escape it under the covers.


    Erik

Chris

On 28 Jul 2005, at 02:02, Scott Ganyo wrote:
Chris,
How about indexing the domain as one field and each part of thepath as separate terms in another field? I'm sure you've probablyalready thought of doing this... and maybe discarded the ideabecause you'd lose the position information. However, even thoughyou can't just simply split the URL on '/' and shove it into thefield, you can add the position information back into the term andthen put it into the field. Then, you would be able to completelyditch the prefix query and still retrieve the documents using theentire, ordered path in (I think) the most efficient way possible.
For example:
http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/modules/commonlaw/
becomes something like (using n/*** to identify the position):

domain: www2.warwick.ac.uk
path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/modules, 8/commonlaw
And you could search based on any prefix you desired. For examplesearching for this:
http://www2.warwick.ac.uk/fac/soc/law/*
would end up being a Lucene search that looks something like this(note: not query parser syntax!):
domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc ANDpath: 3/law
Does that make sense?  Would it work for you?

S

On Jul 27, 2005, at 3:56 PM, Chris May wrote:
Always domain + part of a path e.g.

url:http://blogs.warwick.ac.uk/chrismay/*

or
url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/modules/commonlaw/*
or

url:http://www2.warwick.ac.uk/services/its/*
... and so on. Part of the problem is that we may need to go anarbitrary number of levels down the path to get an acceptablysmall set of documents to start from - we couldn't impose a rulethat said something like 'specify the first 2 directories on thepath' (c.f my second example). We wouldn't need to query for thesame path over different domains though (e.g. url:*.warwick.ac.uk/about/* )
thanks

Chris




On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
Could you give some examples of the types of PrefixQuery's you'dlike to use? Is it always at a granularity of domain andpath? Or are you wanting to do a prefix pieces of the domainand path?
    Erik

On Jul 27, 2005, at 3:47 PM, Chris May wrote:
First, apologies for what seems to be something of an FAQ.
However, I've not been able to find an answer either in LIA orin the relevant section of the FAQ (http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
My setup is as follows: I have an index of a few hundredthousand web pages. I'd like the be able to construct queriesthat search for some arbitrary text within a specified URL.Kind of like google's syntax
searchterm +site:www.foo.com/some/section
So, I have the page title & content indexed, and the URL storedas a keywords field, and I imagined that I'd be able toconstruct a query something like this:
String[] fields = new String[]{DocumentFields.TITLE,DocumentFields.CONTENT};Query searchTextQuery = MultiFieldQueryParser.parse(request.getSearchQuery(), fields, analyzer);PrefixQuery urlPrefix = new PrefixQuery(new Term(DocumentFields.URL, request.getUrlPrefix()));hits = searcher.search(searchTextQuery, new QueryFilter(urlPrefix));
However, as soon as the set of documents returned by theprefixquery is more than a thousand or so, I get aTooManyClausesException, as you might expect.
AFAICS the solutions suggested in the FAQ don't seem to applyhere: I'm already using a Filter, and that's not helping (pacesuggestion 1), I don't think I can reduce the number of termsin the index, else my URLs wouldn't be unique any more, andincreasing the number of clauses seems like a poor choice froma scalability point of view - I anticipate queries that couldfilter perhaps a hundred thousand documents or so.
I'm guessing that it might be possible to do something smart bysplitting the URL up into multiple fields - for example, onefor the host and one for the path, or even one for the host andone for host+path together - but I'm not clear on exactly howI'd use the two fields, and how they'd help. Can someoneenlighten me?
Thanks in advance

Chris
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching a URL with a PrefixQuery / Too Many Clauses (again...)

Reply via email to