Re: SPAM(5.0) Re: How to index part numbers

Karolina Bernat Fri, 28 Jan 2011 05:33:22 -0800

oh, okay.. well for the XML part we use Apache Digester and define rules to
enclose the correct elements. But I can't tell what's the best way to
proceed in your case, sorry. The steps you listed here sound reasonable to
me.


If you want to get search hits for a part number range and highlight
'A123-56' when searching for A124, you would need to create new tokens for
A124 and save all the information (like offset, docId ..), except for the
terms text, for those tokens by copying it from 'A123-56' for each of your
new tokens (I think..).


On Fri, Jan 28, 2011 at 1:45 PM, Wulf Berschin <[email protected]> wrote:

> Hi Karolina,
>
> yes (of course!) We have an XML element for the part numbers, but upto now
> they are not all tagged thus we need regex matching as well...
>
> Am 28.01.2011 13:31, schrieb Karolina Bernat:
>
>> Hi Wulf,
>>
>> can I ask, if it is structured documentation (like XML or SGML) you're
>> dealing with? It's because I also work with technical documentation and we
>> do exactly, waht you're asking for, but it is XML-data.
>>
>>
>> On Fri, Jan 28, 2011 at 1:05 PM, Wulf Berschin<[email protected]>  wrote:
>>
>>  Hi,
>>>
>>> I'm poking in the dark and hope someone has some light...
>>>
>>> We have part numbers in technical documentation to retrieve. For now we
>>> have a (long) regular expression to find those in a string. The part
>>> numbers
>>> have letters, digits and (redundant) whitespace. Furthermore authors
>>> often
>>> used a compressed notation for number ranges with dashes or slashes, like
>>> A123-56 or A123/4.
>>>
>>> When searching for part numbers users should be able to enter specific
>>> numbers like A126 (then the text "A123-56" should be found too) or
>>> wildcard
>>> searches like "A12?" or "A*". This part number seach is a separate
>>> feature
>>> apart from regular full text search.
>>>
>>> As far I see I have to
>>>
>>> - add an extra field for storing part numbers
>>>
>>> - create a Tokenizer which recognizes just the part numbers and skips all
>>> other text
>>>
>>> - create an Analyzer which expands ranges like A123-56 to A123, A124,
>>> ...,
>>> A156 and normalizes numbers by remving whitespace
>>>
>>> With this analyzer I hope to get the highlighting to work too (e.g.
>>> "A123-56" highlighted when "A126" was the search term).
>>>
>>> Is this the right way? What could I use as starting point (I found
>>> org.apache.lucene.analysis.miscellaneous.PatternAnalyzer which does much
>>> more than I need...)
>>>
>>> Thanks for all hints!
>>>
>>> Wulf
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>
>
> --
>
> Mit freundlichen Grüßen,
>
> Wulf Berschin
>
> --
>
> <!-- *****************************************************************
> * Wulf Berschin                            Telefon: +49 6221 1486 16 *
> * DOSCO Document Systems Consulting GmbH   Telefax: +49 6221 1486 19 *
> * Mannheimer Strasse 1                     E-Mail: [email protected] *
> * 69115 Heidelberg, Germany                http://www.dosco.de       *
> * Handelsregister: Heidelberg HRB 335122                             *
> * Geschäftsführung: Robert Erfle                                     *
> ****************************************************************** -->
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: ****SPAM(5.0)**** Re: How to index part numbers

Reply via email to

Re: SPAM(5.0) Re: How to index part numbers