Re: StandardTokenizer and split tokens

Mansour Al Akeel Fri, 22 Jun 2012 18:54:43 -0700

I found the main issue.
I was using ByteRef without the length. This fixed the problem.


                        String word = new 
String(ref.bytes,ref.offset,ref.length);


Thank you.

On Fri, Jun 22, 2012 at 6:26 PM, Mansour Al Akeel
<[email protected]> wrote:
> Hello all,
>
> I am tying to write a simple autosuggest functionality. I was looking
> at some auto suggest code, and came over this post
> http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
> I have been stuck with the some strange words, trying to see how they
> are generated. Here's the Anayzer:
>
> public class AutoCompleteAnalyzer extends Analyzer {
>        public TokenStream tokenStream(String fieldName, Reader reader) {
>                TokenStream result = null;
>                result = new StandardTokenizer(Version.LUCENE_36, reader);
>                result = new EdgeNGramTokenFilter(result,       
> EdgeNGramTokenFilter.Side.FRONT,
> 1, 20);
>                return result;
>        }
> }
>
> And this is the relevant method that does the indexing. It's being
> called with reindexOn("title");
>
> private void reindexOn(String keyword) throws CorruptIndexException,
>                        IOException {
>                log.info("indexing on " + keyword);
>                Analyzer analyzer = new AutoCompleteAnalyzer();
>                IndexWriterConfig config = new 
> IndexWriterConfig(Version.LUCENE_36,
>        analyzer);
>                IndexWriter analyticalWriter = new 
> IndexWriter(suggestIndexDirectory, config);
>                analyticalWriter.commit(); // needed to create the initiale 
> index
>                IndexReader indexReader = 
> IndexReader.open(productsIndexDirectory);
>                Map<String, Integer> wordsMap = new HashMap<String, Integer>();
>                LuceneDictionary dict = new LuceneDictionary(indexReader, 
> keyword);
>                BytesRefIterator iter = dict.getWordsIterator();
>                BytesRef ref = null;
>                while ((ref = iter.next()) != null) {
>                        String word = new String(ref.bytes);
>                        int len = word.length();
>                        if (len < 3) {
>                                continue;
>                        }
>                        if (wordsMap.containsKey(word)) {
>                                String msg = "Word " + word + " Already 
> Exists";
>                                throw new IllegalStateException(msg);
>                        }
>                        wordsMap.put(word, indexReader.docFreq(new 
> Term(keyword, word)));
>                }
>
>                for (String word : wordsMap.keySet()) {
>                        Document doc = new Document();
>                        Field field = null;
>                        field = new Field(SOURCE_WORD_FIELD, word, 
> Field.Store.YES,
> Field.Index.NOT_ANALYZED);
>                        doc.add(field);
>                        field = new Field(GRAMMED_WORDS_FIELD, word,
> Field.Store.YES,        Field.Index.ANALYZED);
>                        doc.add(field);
>                        String count = Integer.toString(wordsMap.get(word));
>                        field = new Field(COUNT_FIELD, count, Field.Store.NO,
> Field.Index.NOT_ANALYZED); // count
>                        doc.add(field);
>                        analyticalWriter.addDocument(doc);
>                }
>                analyticalWriter.commit();
>                analyticalWriter.close();
>                indexReader.close();
>        }
>
>        private static final String GRAMMED_WORDS_FIELD = "words";
>        private static final String SOURCE_WORD_FIELD = "sourceWord";
>        private static final String COUNT_FIELD = "count";
>
> And now, my unit testing :
>
>        @BeforeClass
>        public static void setUp() throws CorruptIndexException, IOException {
>                String idxFileName = "myIndexDirectory";
>                Indexer indexer = new Indexer(idxFileName);
>                indexer.addDoc("Apache Lucene in Action");
>                indexer.addDoc("Lord of the Rings");
>                indexer.addDoc("Apache Solr in Action");
>                indexer.addDoc("apples and Oranges");
>                indexer.addDoc("apple iphone");
>                indexer.reindexKeywords();
>                search = new SearchEngine(idxFileName);
>        }
>
> The strange part, is looking under the index I found there are
> sourceWords (lordne, applee, solres ). I understand that the ngram
> will result in parts of each word. Ex:
>
> l
> lo
> lor
> lord
>
> But of these go into one field, but what about "lorden" and "solres"
> ?? I checked the docs for this, and looked into Jira, but didn't find
> relevant info.
> Is there something I am missing ??
>
> I understand there could be easier ways to create this functionality
> (http://wiki.apache.org/lucene-java/SpellChecker), but I like to
> resolve this issue, and to
> understand if I am doing something wrong.
>
> Thank you in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: StandardTokenizer and split tokens

Reply via email to