I still need all the normal benefits of the StandardAnalyzer as far as punctuation and everything else goes, with just this one special exception. Since I was on a limited schedule I ended up just doing the method where I escape these cases myself in a way that makes them get tokenized. Certainly not the best solution but it works as far as I can tell. If hacking the StandardAnalyzer grammar is somewhat straightforward then I will try and looking into doing it that way since I prefer to do things the right way if possible : ) Thanks Grant!
- Greg On Wed, Dec 10, 2008 at 8:52 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > Let's take a quick step back and see if it helps. Why do you feel you need > the StandardAnalyzer to solve your problem? What else are you gaining from > it? Would you be better served by a WhitespaceTokenizer? > > That being said, hacking up the grammar isn't as bad as you might think. > There are actually two examples of the "grammar" in Lucene, one is the > StdTokenizer and the other is the WikipediaTokenizer. They are similar, but > maybe by looking at two examples it might also help. > > > > On Dec 9, 2008, at 10:14 AM, Greg Shackles wrote: > > Hey everyone, >> >> I'm running into a problem where some punctuation that I would actually >> want >> to keep gets thrown out because they don't get tokenized. By far the most >> common case for this is ampersand, but it does happen with others as well. >> My concern isn't even so much in that I need to be able to enforce that >> punctuation in the search, but more that I need to know it was there when >> I >> get the results. I am attaching important word data to the payload of >> each >> token, so if a "word" was just an ampersand, it disappears. I took a >> quick >> look at the StandardAnalyzer classes and it looks like it would be a pain >> to >> try and modify that directly (I don't have much experience in >> grammar/parsers). A couple options come to mind, but I wanted to make >> sure >> there wasn't a better, more elegant solution before I did something that >> felt a little hacky: >> >> 1) Add a couple fields to the payload saying whether the previous/next >> word >> is a single punctuation mark, and which it is. Then the search can insert >> the punctuation in the results. The downside to this would be losing the >> metadata that would have gone into the payload for that punctuation mark. >> >> 2) Do some sort of string replacement logic during indexing and searching >> to >> change it into something that will get made into a token, but should not >> appear naturally on its own in the text. I usually shy away from >> solutions >> like this, but sometimes they prove useful. >> >> Has anyone done anything like this? I don't want to lose most of >> StandardAnalyzer's punctuation logic, but mainly I want to tokenize >> punctuation if it appears by itself (surrounded by whitespace). Thanks! >> >> - Greg >> > > -------------------------- > Grant Ingersoll > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >