Mariella Di Giacomo wrote:
Hi ALL,
We are trying to index scientic articles written in english, but whose authors can be spelled in any language (depending on the author's nazionality)
E.g. Schäffer
In the XML document that we provide to Lucene the author name is written in the following way (using HTML ENTITIES)
Schäffer
So in practice that is the name that would be given to a Lucene analyzer/filter
Is there any already written analyzer that would take that name (Schäffer or any other name that has entities) so that
Lucene index could searched (once the field has been indexed) for the real version of the name, which is
Schäffer
and the english spelled version of the name which is
Schaffer
Thanks a lot in advance for your help,
If I understand the question then I think there are 2 ways of doing it.
[1] Write a custom analyzer that uses Token.setPositionIncrement(0) to put alternate spellings at the same place in the token stream. This way phrase matches work right (so the query "Jonathan Schaffer" and "Jonathan Schäffer" will match the same phrase in the doc).
[2] Do not use a special analyzer - instead do query expansion, so if they search for "Schaffer" then the generated query is (Schaffer Schäffer).
I've used both techniques before - I use #1 w/ a "JavadocAnalyzer" on searchmorph.com so that if you search for "hash" you'll see matches for "HashMap", as "HashMap" is tokenized into 3 tokens at the same location ( 'hash', 'map, 'hashmap'). Writing this kind of an analyzer can be a bit of a hassle and the position increment of 0 might affect highlighting code or other (say, summarizing) code that uses the Analyzer.
For an example of #2 see my Wordnet/Synonym query expansion example in the lucene sandbox. You prebuild an index of synonyms (or in your case maybe just rules are fine). Then you need query expansion code that takes "Schaffer" and expands it to something like "Schaffer Schäffer^0.9" (if you want to assume the user probably spells the name right). Simple enough to code, only hassle then is if you want to use the standard QueryParser...
thx, Dave
Mariella
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]