Hi, Don't know the size of your dataset. But, couldn't you index in 2 fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field, and WhiteSpace for the other.
Then use multiple field query (there is a query parser for that, just don't remember the name right now). Patrick On 10/1/07, John Byrne <[EMAIL PROTECTED]> wrote: > Whitespace analyzer does preserve those symbols, but not as tokens. It > simply leaves them attached to the original term. > > As an example of what I'm talking about, consider a document that > contains (without the quotes) "foo, ". > > Now, using WhitespaceAnalyzer, I could only get that document by > searching for "foo,". Using StandardAnalyzer or any analyzer that > removes punctuation, I could only find it by searching for "foo". > > I want an analyzer that will allow me to find it if I build a phrase > query with the term "foo" followed immediately by ",". After all, the > comma may be relevant to the search, but is definitely not part of the > word. > > Extending StandardAnalyer is what I had in mind, but I don't know where > to start. I also wonder why no-one seems to have done it before- it > makes me suspect that there's some reason I haven't seen yet that makes > it impossible ot impractical. > > > > Karl Wettin wrote: > > > > 1 okt 2007 kl. 15.33 skrev John Byrne: > > > >> Has anyone written an analyzer that preserves puncuation and > >> synmbols ("£", "$", "%" etc.) as tokens? > > > > WhitespaceAnalyzer? > > > > You could also extend the lexical rules of StandardAnalyzer. > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]