Hi, >There'll be at least two indexes, one "normal" one and another bloated one. >Dan suggested splitting, too, but, unfortunately, if users search for e.g. > >"9-Oxabicyclo[3.3.1]nona-2,6-diene" > >they don't want anything else than that substance, as opposed to > >"*-Oxabicyclo[3.3.1]nona*" > >where they'd be interested in substances from that family - whatever the >numbers are.
As Erik says, it seems that you basically want a PhraseQuery. At first glance you should tokenize on "-" to have the "words" of your "phrase". Then, when you have a number, i.e. "9", "[3.3.1]", "2,6", you could output "dummy tokens" that would take *one* position, say "NUMBER_ALONE", "NUMBER_BETWEEN_BACKETS", "COMMA_SEPARATED_NUMBERS"... or just "NUMBER". When you process : "hexahydronaphthalene" tokenize it as : "hexa hydro naphtal ene". "heptahydronaphthalin" tokenize it as : "hepta hydro naphtal in". The PhraseQuery "hydronaphtal", tokenized by the same tokenizer as "hydro naphtal" will match both in a PhraseQuery. Sorry, my knowledge in organic chemistry is far :-) and I wonder if these considerations are not over-simplistic. Maybe your "same position" approach (solved by a "token stack" algorithm) is the least worse one. Exciting analysis problematics however, Cherrs, p.b. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
