Hello all, I am tying to write a simple autosuggest functionality. I was looking at some auto suggest code, and came over this post http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene I have been stuck with the some strange words, trying to see how they are generated. Here's the Anayzer:
public class AutoCompleteAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = null; result = new StandardTokenizer(Version.LUCENE_36, reader); result = new EdgeNGramTokenFilter(result, EdgeNGramTokenFilter.Side.FRONT, 1, 20); return result; } } And this is the relevant method that does the indexing. It's being called with reindexOn("title"); private void reindexOn(String keyword) throws CorruptIndexException, IOException { log.info("indexing on " + keyword); Analyzer analyzer = new AutoCompleteAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer); IndexWriter analyticalWriter = new IndexWriter(suggestIndexDirectory, config); analyticalWriter.commit(); // needed to create the initiale index IndexReader indexReader = IndexReader.open(productsIndexDirectory); Map<String, Integer> wordsMap = new HashMap<String, Integer>(); LuceneDictionary dict = new LuceneDictionary(indexReader, keyword); BytesRefIterator iter = dict.getWordsIterator(); BytesRef ref = null; while ((ref = iter.next()) != null) { String word = new String(ref.bytes); int len = word.length(); if (len < 3) { continue; } if (wordsMap.containsKey(word)) { String msg = "Word " + word + " Already Exists"; throw new IllegalStateException(msg); } wordsMap.put(word, indexReader.docFreq(new Term(keyword, word))); } for (String word : wordsMap.keySet()) { Document doc = new Document(); Field field = null; field = new Field(SOURCE_WORD_FIELD, word, Field.Store.YES, Field.Index.NOT_ANALYZED); doc.add(field); field = new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES, Field.Index.ANALYZED); doc.add(field); String count = Integer.toString(wordsMap.get(word)); field = new Field(COUNT_FIELD, count, Field.Store.NO, Field.Index.NOT_ANALYZED); // count doc.add(field); analyticalWriter.addDocument(doc); } analyticalWriter.commit(); analyticalWriter.close(); indexReader.close(); } private static final String GRAMMED_WORDS_FIELD = "words"; private static final String SOURCE_WORD_FIELD = "sourceWord"; private static final String COUNT_FIELD = "count"; And now, my unit testing : @BeforeClass public static void setUp() throws CorruptIndexException, IOException { String idxFileName = "myIndexDirectory"; Indexer indexer = new Indexer(idxFileName); indexer.addDoc("Apache Lucene in Action"); indexer.addDoc("Lord of the Rings"); indexer.addDoc("Apache Solr in Action"); indexer.addDoc("apples and Oranges"); indexer.addDoc("apple iphone"); indexer.reindexKeywords(); search = new SearchEngine(idxFileName); } The strange part, is looking under the index I found there are sourceWords (lordne, applee, solres ). I understand that the ngram will result in parts of each word. Ex: l lo lor lord But of these go into one field, but what about "lorden" and "solres" ?? I checked the docs for this, and looked into Jira, but didn't find relevant info. Is there something I am missing ?? I understand there could be easier ways to create this functionality (http://wiki.apache.org/lucene-java/SpellChecker), but I like to resolve this issue, and to understand if I am doing something wrong. Thank you in advance. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org