Hi Wulf, can I ask, if it is structured documentation (like XML or SGML) you're dealing with? It's because I also work with technical documentation and we do exactly, waht you're asking for, but it is XML-data.
On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin <bersc...@dosco.de> wrote: > Hi, > > I'm poking in the dark and hope someone has some light... > > We have part numbers in technical documentation to retrieve. For now we > have a (long) regular expression to find those in a string. The part numbers > have letters, digits and (redundant) whitespace. Furthermore authors often > used a compressed notation for number ranges with dashes or slashes, like > A123-56 or A123/4. > > When searching for part numbers users should be able to enter specific > numbers like A126 (then the text "A123-56" should be found too) or wildcard > searches like "A12?" or "A*". This part number seach is a separate feature > apart from regular full text search. > > As far I see I have to > > - add an extra field for storing part numbers > > - create a Tokenizer which recognizes just the part numbers and skips all > other text > > - create an Analyzer which expands ranges like A123-56 to A123, A124, ..., > A156 and normalizes numbers by remving whitespace > > With this analyzer I hope to get the highlighting to work too (e.g. > "A123-56" highlighted when "A126" was the search term). > > Is this the right way? What could I use as starting point (I found > org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much > more than I need...) > > Thanks for all hints! > > Wulf > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >