Hi Wulf,

can I ask, if it is structured documentation (like XML or SGML) you're
dealing with? It's because I also work with technical documentation and we
do exactly, waht you're asking for, but it is XML-data.


On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin <bersc...@dosco.de> wrote:

> Hi,
>
> I'm poking in the dark and hope someone has some light...
>
> We have part numbers in technical documentation to retrieve. For now we
> have a (long) regular expression to find those in a string. The part numbers
> have letters, digits and (redundant) whitespace. Furthermore authors often
> used a compressed notation for number ranges with dashes or slashes, like
> A123-56 or A123/4.
>
> When searching for part numbers users should be able to enter specific
> numbers like A126 (then the text "A123-56" should be found too) or wildcard
> searches like "A12?" or "A*". This part number seach is a separate feature
> apart from regular full text search.
>
> As far I see I have to
>
> - add an extra field for storing part numbers
>
> - create a Tokenizer which recognizes just the part numbers and skips all
> other text
>
> - create an Analyzer which expands ranges like A123-56 to A123, A124, ...,
> A156 and normalizes numbers by remving whitespace
>
> With this analyzer I hope to get the highlighting to work too (e.g.
> "A123-56" highlighted when "A126" was the search term).
>
> Is this the right way? What could I use as starting point (I found
> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
> more than I need...)
>
> Thanks for all hints!
>
> Wulf
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to