On Jul 28, 2005, at 12:37 PM, Chris May wrote:
Works beautifully (at least on my 30K-document test index ). I'll
need to do some fiddling if I want to allow partial URLs (i.e.
http://www2.warwick.ac.uk/ab* to match http://www2.warwick.ac.uk/
about) but I can see how to do that, I think (and I'm not sure I
need it anyway).
Thanks Scott!
Incidentally, is there an easy way to make QueryParser not treat
the colon in 'http://' as a term separator? It seems that URLS get
broken into two chunks ('http' and 'www.warwick.ac.uk/somewhere')
before they get fed to my custom analyzer. I got round it by just
constructing the PhraseQuery by hand, but I wonder if there's an
easier way ?
I'm not sure what string you're passing to QP, but the : denotes a
field selector (such as title:lucene). There is no easy way for
QueryParser to deal with that differently - it'd be custom parser at
that point. You can backslash escape it \:, but that is probably not
desirable. Or you could pre-process the string from the user before
handing it to QP and escape it under the covers.
Erik
Chris
On 28 Jul 2005, at 02:02, Scott Ganyo wrote:
Chris,
How about indexing the domain as one field and each part of the
path as separate terms in another field? I'm sure you've probably
already thought of doing this... and maybe discarded the idea
because you'd lose the position information. However, even though
you can't just simply split the URL on '/' and shove it into the
field, you can add the position information back into the term and
then put it into the field. Then, you would be able to completely
ditch the prefix query and still retrieve the documents using the
entire, ordered path in (I think) the most efficient way possible.
For example:
http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/
modules/commonlaw/
becomes something like (using n/*** to identify the position):
domain: www2.warwick.ac.uk
path: 1/fac, 2/soc, 3/law, 4/ug, 5/propective, 6/degrees, 7/
modules, 8/commonlaw
And you could search based on any prefix you desired. For example
searching for this:
http://www2.warwick.ac.uk/fac/soc/law/*
would end up being a Lucene search that looks something like this
(note: not query parser syntax!):
domain: www2.warwick.ac.uk AND path: 1/fac AND path: 2/soc AND
path: 3/law
Does that make sense? Would it work for you?
S
On Jul 27, 2005, at 3:56 PM, Chris May wrote:
Always domain + part of a path e.g.
url:http://blogs.warwick.ac.uk/chrismay/*
or
url:http://www2.warwick.ac.uk/fac/soc/law/ug/prospective/degrees/
modules/commonlaw/*
or
url:http://www2.warwick.ac.uk/services/its/*
... and so on. Part of the problem is that we may need to go an
arbitrary number of levels down the path to get an acceptably
small set of documents to start from - we couldn't impose a rule
that said something like 'specify the first 2 directories on the
path' (c.f my second example). We wouldn't need to query for the
same path over different domains though (e.g. url:*.warwick.ac.uk/
about/* )
thanks
Chris
On 27 Jul 2005, at 21:33, Erik Hatcher wrote:
Could you give some examples of the types of PrefixQuery's you'd
like to use? Is it always at a granularity of domain and
path? Or are you wanting to do a prefix pieces of the domain
and path?
Erik
On Jul 27, 2005, at 3:47 PM, Chris May wrote:
First, apologies for what seems to be something of an FAQ.
However, I've not been able to find an answer either in LIA or
in the relevant section of the FAQ (http://wiki.apache.org/
jakarta-lucene/
LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831)
My setup is as follows: I have an index of a few hundred
thousand web pages. I'd like the be able to construct queries
that search for some arbitrary text within a specified URL.
Kind of like google's syntax
searchterm +site:www.foo.com/some/section
So, I have the page title & content indexed, and the URL stored
as a keywords field, and I imagined that I'd be able to
construct a query something like this:
String[] fields = new String[]
{DocumentFields.TITLE,DocumentFields.CONTENT};
Query searchTextQuery = MultiFieldQueryParser.parse
(request.getSearchQuery(), fields, analyzer);
PrefixQuery urlPrefix = new PrefixQuery(new Term
(DocumentFields.URL, request.getUrlPrefix()));
hits = searcher.search(searchTextQuery, new QueryFilter
(urlPrefix));
However, as soon as the set of documents returned by the
prefixquery is more than a thousand or so, I get a
TooManyClausesException, as you might expect.
AFAICS the solutions suggested in the FAQ don't seem to apply
here: I'm already using a Filter, and that's not helping (pace
suggestion 1), I don't think I can reduce the number of terms
in the index, else my URLs wouldn't be unique any more, and
increasing the number of clauses seems like a poor choice from
a scalability point of view - I anticipate queries that could
filter perhaps a hundred thousand documents or so.
I'm guessing that it might be possible to do something smart by
splitting the URL up into multiple fields - for example, one
for the host and one for the path, or even one for the host and
one for host+path together - but I'm not clear on exactly how
I'd use the two fields, and how they'd help. Can someone
enlighten me?
Thanks in advance
Chris
------------------------------------------------------------------
---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]