On Friday 14 November 2003 13:39, Chong, Herb wrote: > you're describing ad-hoc solutions to a problem that have an effect, but > not one that is easily predictable. one can concoct all sorts of > combinations of the query operators that would have something of the effect > that i am describing. crossing sentence boundaries, however, can't be done
Hmmh? You implied that there are some useful distance heuristics (words 5 words apart or more correlate much less), and others have pointed out Lucene has many useful components. Building more complex system from small components is usually considered a Good Thing (tm), not an "ad hoc solution". In fact, I would guess most experienced people around here start with Lucene defaults, and build their own systems gradually customizing more and more of pieces. It may be there are actual fundamental problems with Lucene, regarding approach you'd prefer, but I don't think it makes sense to brush off suggestions regarding distance & fuzzy/sloppy queries by claiming they are "just hacks". > without having some sentence boundaries as a reference. on top of this, > there is a relatively simple concept which, if implemented, takes away all > the ad-hocness of the solutions and replaces it with a something that is > both linguistically and mathematically sound and on top of which won't Like most people have pointed out, linguistics are nothing of sorts of exact science; and comparing it to maths sounds like apples vs. oranges to me. I'm not even convinced one can use general terms like "linguistically sound"; especially as content being indexed and searched on is often mixture of natural and programming languages (at least with knowledge bases I work with). Now; if you (or anyone else) could build more advanced query mechanism either on top of Lucene fundamentals, or have modified version, THAT would be useful. But it's more efficient to first consider suggestions, and especially WHAT WORKS as opposed to argue for what appears most elegant a solution. > materially make the engine core more complicated. that concept is that > multiword queries are mostly multiword terms and they can't cross sentence > boundaries according to the rules of English. Which brings us back to the problem of detecting boundaries. Punctuation can help; classifications of words can help; all are inexact "science". Which just makes me wonder if just considering token distances might then just be plenty good enough. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
