Hi, Mark, Thanks for providing this original approach for synonyms. I read through your code and think maybe this could be extended to handle the word stemming problem as well.
Here is my thought. 1) Before indexing, create a Map<String, ArrayList<String>> stemmedWordMap, the key is the stemmed word. 1) At indexing, we still index the word as it is, but, we stem the word (using PorterStemmer) and then insert/update the stemmedWordMap to add the mapping: stemmedWord <=>Word. Example, "lighting", "lighted", these two words will be stored in the ArrayList with the key "light". 2) At query time, when someone searched on "lighting", we stem the word to "light", then, find from the stemmedWordMap the synonyms for this word. In this case, we find "lighted". Then, we perform the search using the synonyms search. This way, we can combine both the synonyms and the stemmed words together. The nice part of this is, we only need to store the index with the original words. Saving disk space as well as indexing time. However, I do have the following concerns: 1) As documents could be removed from the index, the stemmedWordMap needs to be somehow kept up to date. This could be done periodically by rebuilding the stemmedWordMap? 2) Typically, people would like to see their exact match first. So, the synonyms search could be enhanced to take advantage of the position level boosting (payload for position). So, the search result for "lighting" should rank the documents with 'lighting" higher than documents with "lighted". 3) I am still not sure if this is a best approach in general. Does it make sense to keep the two indexes, one with original words indexed, the other one with all words stemmed? Then, searching will be run against both indexes. 4) How does Google perform this type of search? I guess the web search engines have different approach. There maybe no need for using a stemmer at all. First, the web documents are huge, searching for "lighting" will bring up enough results, who cares bringing back results with "lighted"? Second, the anchor texts that point to a web page of interest would contain all the variants (synonyms and stemmed words), so, they don't need to worry about search results being incomplete? For example, search for "rectangular" in google, http://www.google.com/search?hl=en&q=rectangular&btnG=Search, the wikipedia page comes up first. It only contains "Rectangle", however, click on Cached link, you will see "rectangular" is contained in the anchor text that points to this page. My ultimate question, if I want to do a search engine, as a general rule, what's the best way to do it? Mark, could be shed some light? Thanks, Jian On 3/18/07, Mark Harwood (JIRA) <[EMAIL PROTECTED]> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Mark Harwood updated LUCENE-835: -------------------------------- Attachment: TestSynonymIndexReader.java > An IndexReader with run-time support for synonyms > ------------------------------------------------- > > Key: LUCENE-835 > URL: https://issues.apache.org/jira/browse/LUCENE-835 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.1 > Reporter: Mark Harwood > Assigned To: Mark Harwood > Attachments: Synonym.java, SynonymIndexReader.java, SynonymSet.java, TestSynonymIndexReader.java > > > These classes provide support for enabling the use of synonyms for terms in an existing index. > While Analyzers can be used at Query-parse time or Index-time to inject synonyms these are not always satisfactory means of providing support for synonyms: > * Index-time injection of synonyms is less flexible because changing the lists of synonyms requires an index rebuild. > * Query-parse-time injection is awkward because special support is required in the parser/query logic to recognise and cater for the tokens that appear in the same position. Additionally, any statistical analysis of the index content via TermEnum/TermDocs etc does not consider the synonyms unless specific code is added. > What is perhaps more useful is a transparent wrapper for the IndexReader that provides a synonym-ized view of the index without requiring specialised support in the calling code. All of the TermEnum/TermDocs interfaces remain the same but behind the scenes synonyms are being considered/applied silently. > The classes supplied here provide this "virtual" view of the index and all queries or other code that examines this index using the special reader benefit from this view without requiring specialized code. A Junit test illustrates this code in action. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]