oh, okay.. well for the XML part we use Apache Digester and define rules to
enclose the correct elements. But I can't tell what's the best way to
proceed in your case, sorry. The steps you listed here sound reasonable to
me.

If you want to get search hits for a part number range and highlight
'A123-56' when searching for A124, you would need to create new tokens for
A124 and save all the information (like offset, docId ..), except for the
terms text, for those tokens by copying it from 'A123-56' for each of your
new tokens (I think..).


On Fri, Jan 28, 2011 at 1:45 PM, Wulf Berschin <bersc...@dosco.de> wrote:

> Hi Karolina,
>
> yes (of course!) We have an XML element for the part numbers, but upto now
> they are not all tagged thus we need regex matching as well...
>
> Am 28.01.2011 13:31, schrieb Karolina Bernat:
>
>> Hi Wulf,
>>
>> can I ask, if it is structured documentation (like XML or SGML) you're
>> dealing with? It's because I also work with technical documentation and we
>> do exactly, waht you're asking for, but it is XML-data.
>>
>>
>> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<bersc...@dosco.de>  wrote:
>>
>>  Hi,
>>>
>>> I'm poking in the dark and hope someone has some light...
>>>
>>> We have part numbers in technical documentation to retrieve. For now we
>>> have a (long) regular expression to find those in a string. The part
>>> numbers
>>> have letters, digits and (redundant) whitespace. Furthermore authors
>>> often
>>> used a compressed notation for number ranges with dashes or slashes, like
>>> A123-56 or A123/4.
>>>
>>> When searching for part numbers users should be able to enter specific
>>> numbers like A126 (then the text "A123-56" should be found too) or
>>> wildcard
>>> searches like "A12?" or "A*". This part number seach is a separate
>>> feature
>>> apart from regular full text search.
>>>
>>> As far I see I have to
>>>
>>> - add an extra field for storing part numbers
>>>
>>> - create a Tokenizer which recognizes just the part numbers and skips all
>>> other text
>>>
>>> - create an Analyzer which expands ranges like A123-56 to A123, A124,
>>> ...,
>>> A156 and normalizes numbers by remving whitespace
>>>
>>> With this analyzer I hope to get the highlighting to work too (e.g.
>>> "A123-56" highlighted when "A126" was the search term).
>>>
>>> Is this the right way? What could I use as starting point (I found
>>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>>> more than I need...)
>>>
>>> Thanks for all hints!
>>>
>>> Wulf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>
> --
>
> Mit freundlichen Grüßen,
>
> Wulf Berschin
>
> --
>
> <!-- *****************************************************************
> * Wulf Berschin                            Telefon: +49 6221 1486 16 *
> * DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
> * Mannheimer Strasse 1                     E-Mail: bersc...@dosco.de *
> * 69115 Heidelberg, Germany                http://www.dosco.de       *
> * Handelsregister: Heidelberg HRB 335122                             *
> * Geschäftsführung: Robert Erfle                                     *
> ****************************************************************** -->
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to