I was trying to apply both org.apache.solr.analysis.WordDelimiterFilter and org.apache.lucene.analysis.ngram.NGramTokenFilter.
Can I achive this with lucene's TokenStream? While thinking about TokenFilters, I came to an idea that the TokenStream should have a structured representation. It is much like as we do in XML sax reader. XML is a serialized, character stream, and it also has a structure. One example to show what happen in TokenFilter process, suppose we have one RAW term. ---------------------- [RAW] term ---------------------- and it will be tokenized to ---------------------- [TOKENIZED1] termB1 termA < > termC termB2 ---------------------- and tne next token filter may tokenize to ---------------------- [TOKENIZED2] termB1-1 - termB1-2 termA < > termC termB2 ---------------------- and tne next token filter may tokenize to ---------------------- [TOKENIZED3] termB1-1-1 < > - termB1-2 termB1-1-2 termA < > termC termB2 ---------------------- Then, what we should do in indexing, and querying with this TokenStream? I read the code and see that current lucene implementation can handle TOKENIZED1 query (in org.apache.lucene.queryParser.QueryParser#getFieldQuery). But it can't handle TOKENIZED2 or TOKENIZED3. ... is this right? One solution may be, using Token.type to describe the structure like this: <token type="and"> <token type="word" value="termA"/> <token type="or"> <token type="and"> <token type="or"> <token type="word" value="termB1-1-1"/> <token type="word" value="termB1-1-2"/> </token> <token type="word" value="termB1-2"/> </token> <token type="word" value="termB2"/> </token> <token type="word" value="termC"/> </token> Another solution may be, adding a internal flag-table to Token.flag or TokenStream, to describe the TokenStream structure. Does anybody have suggestions? ---- Hiroaki Kawai --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]