Well, the size wouldn't be a problem, we could afford the extra field.
But it would seem to complicate the search quite a lot. I'd have to run
the search terms through both analyzers. It would be much simpler if the
characters were indexed as separate tokens.
Patrick Turcotte wrote:
Hi,
Don't know the size of your dataset. But, couldn't you index in 2
fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
and WhiteSpace for the other.
Then use multiple field query (there is a query parser for that, just
don't remember the name right now).
Patrick
On 10/1/07, John Byrne <[EMAIL PROTECTED]> wrote:
Whitespace analyzer does preserve those symbols, but not as tokens. It
simply leaves them attached to the original term.
As an example of what I'm talking about, consider a document that
contains (without the quotes) "foo, ".
Now, using WhitespaceAnalyzer, I could only get that document by
searching for "foo,". Using StandardAnalyzer or any analyzer that
removes punctuation, I could only find it by searching for "foo".
I want an analyzer that will allow me to find it if I build a phrase
query with the term "foo" followed immediately by ",". After all, the
comma may be relevant to the search, but is definitely not part of the
word.
Extending StandardAnalyer is what I had in mind, but I don't know where
to start. I also wonder why no-one seems to have done it before- it
makes me suspect that there's some reason I haven't seen yet that makes
it impossible ot impractical.
Karl Wettin wrote:
1 okt 2007 kl. 15.33 skrev John Byrne:
Has anyone written an analyzer that preserves puncuation and
synmbols ("£", "$", "%" etc.) as tokens?
WhitespaceAnalyzer?
You could also extend the lexical rules of StandardAnalyzer.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]