OK, forget the stuff about "TooManyBooleanClauses". I finally figured out
that if I specify the surround to have the same semantics as a SpanRegex (
i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes
more sense to me now.

Specifying 20w(eri*, mal*) is what I was using before.

Erick

On 10/9/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

OK, I'm using the surround code, and it seems to be working...with the
following questions (always, more questions)...

> I'm gettng an exception sometimes of TooManyBasicQueries. I can control
this by initializing BasicQueryFactory with a larger number. Do you have any
cautions about upping this number?

> There's a hard-coded value minimumPrefixLength set to 3 down in the code
Surround query parser (allowedSuffix). I see no method to change this. I
assume that this is to prevent using up too much memory/time. What should I
know about this value? I'm mostly interested in a justification for the
product manager why allowing, say, two character (or one character) prefixes
is a bad idea <G>.

> I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal
to Surround queries. That is, trying RegexSpanQuery doesn't want to work at
all with the same search clause, as it runs out of memory pretty
quickly......

However, working with three-letter prefixes is blazingly fast.........

Thanks again...

Erick

On 10/6/06, Paul Elschot < [EMAIL PROTECTED]> wrote:
>
> Mark,
>
> On Friday 06 October 2006 22:46, Mark Miller wrote:
> > Paul's parser is beyond my feeble comprehension...but I would start by
> > looking at SrndTruncQuery. It looks to me like this enumerates each
> > possible match just like a SpanRegexQuery does...I am too lazy to
> figure
> > out what the visitor pattern is doing so I don't know if they then get
> > added to a boolean query, but I don't know what else would happen. If
>
> They can also be added to a SpanOrQuery as SpanTermQuery,
> this depends on the context of the query (distance query or not).
> The visitor pattern is used to have the same code for distance queries
> and other queries as far as possible.
>
> > this is the case, I am wondering if it is any more efficient than the
> > SpanRegex implementation...which could be changed to a SpanWildcard
>
> I don't think the surround implementation of expanding terms is more
> efficient that the Lucene implementation.
> Surround does have the functionality of a SpanWildCard, but
> the implementation of the expansion is shared, see above.
>
> > implementation. How exactly is this better at avoiding a
> toomanyclauses
> > exception or ram fillup. Is it just the fact that the (lets say) three
>
> > wildcard terms are anded so this should dramatically reduce the
> matches?
>
> The limitation in BasicQueryFactory works for a complete surround query,
> which can be nested.
> In Lucene only the max nr of clauses for a single level BooleanQuery
> can be controlled.
>
> >...
>
> Regards,
> Paul Elschot
>
>
> > - Mark
> >
> > Erick Erickson wrote:
> > > Paul:
> > >
> > > Splendid! Now if I just understood a single thing about the
> SrndQuery
> > > family
> > > <G>.
> > >
> > > I followed your link, and took a look at the text file. That should
> > > give me
> > > enough to get started.
> > >
> > > But if you wanted to e-mail me any sample code or long explanations
> of
> > > what
> > > this all does, I would forever be your lackey <G>....
> > >
> > > I should also fairly easily be able to run a few of these against
> the
> > > partial index I already have to get some sense of now it'll all work
>
> > > out in
> > > my problem space. I suspect that the actual number of distinct terms
> > > won't
> > > grow too much after the first 4,000 books, so it'll probably be
> pretty
> > > safe
> > > to get this running in the "worst case", find out if/where things
> blow
> > > up,
> > > and put in some safeguards. Or perhaps discover that it's completely
> and
> > > entirely perfect <G>.
> > >
> > > Thanks again
> > > Erick
> > >
> > > On 10/6/06, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > >>
> > >> On Friday 06 October 2006 14:37, Erick Erickson wrote:
> > >> ...
> > >> > Fortunately, the PM agrees that it's silly to think about span
> queries
> > >> > involving OR or NOT for this app. So I'm left with something like
> Jo*n
> > >> AND
> > >> > sm*th AND jon?es WITHIN 6.
> > >>
> > >> OR works much the same as term expansion for wildcards.
> > >>
> > >> > The only approach that's occurred to me is to create a filter on
> > >> for the
> > >> > terms, giving me a subset of my docs that have any terms
> satisfying
> > >> the
> > >> > above. For each doc in the filter, get creative with
> > >> TermPositionVector
> > >> for
> > >> > determining whether the document matches. It seems that this
> would
> > >> involve
> > >> > creating a list of all positions in each doc in my filter that
> match
> > >> jo*n,
> > >> > another for sm*th, and another for jon?es and seeing if the
> distance
> > >> > (however I define that) between any triple of terms (one from
> each
> > >> list)
> > >> is
> > >> > less than 6.
> > >>
> > >> > My gut feel is that this explodes time-wise based upon the number
> of
> > >> terms
> > >> > that match. In this particular application, we are indexing 20K
> books.
> > >> Based
> > >> > on indexing 4K of them, this amounts to about a 4G index
> (although I
> > >> > acutally expect this to be somewhat larger since I haven't
> indexed all
> > >> the
> > >> > fields, just the text so far). I can't imagine that comparing the
> > >> expanded
> > >> > terms for, say, 10,000 docs will be fast. I'm putting together an
> > >> experiment
> > >> > to test this though.
> > >> >
> > >> > But someone could save me a lot of work by telling me that this
> is
> > >> solved
> > >> > already. This is your chance <G>......
> > >>
> > >> It's solved :) here:
> > >> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/
> > >>
> > >> The surround query language uses only the spans package for
> > >> WITHIN like queries, no filters.
> > >> You may not want to use the parser, but all the rest could be
> handy.
> > >>
> > >> > The expanding queries (e.g. PrefixQuery, RegexQuery,
> WildcardQuery)
> > >> all
> > >> blow
> > >> > up with TooManyClauses, and I've tried upping the MaxClauses
> field but
> > >> that
> > >> > takes forever and *then* blows up. Even with -Xmx set as high as
> I
> > >> can.
> > >>
> > >> The surround language has its own limitation on the maximum number
> > >> of terms expanded for wildcards, and it works nicely even for
> rather
> > >> high numbers of terms (thousands) for WITHIN like queries,
> > >> given enough RAM.
> > >>
> > >> It shouldn't be too difficult to add NOT queries within WITHIN,
> > >> there already is a SpanNotQuery in Lucene to map onto.
> > >>
> > >> Regards,
> > >> Paul Elschot
> > >>
> > >>
> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >> For additional commands, e-mail: [EMAIL PROTECTED]
> > >>
> > >>
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to