Hey Chris,
There is just such an analyzer, called the PerFieldAnalyzerWrapper. The
trick is the Analyzer always passes in the Field name when it gets the
TokenStream,
-Grant
Chris Brown wrote:
Bear with me, I might be missing something.... My documents get
indexed ( writer.addDocument(doc) ) with one IndexWriter created using
one Analyzer (the SnowballAnalyzer). So unless you can somehow use a
different Analyzer per field I don't see how the second field will
help. If I get the TermFreqVector for a field for a document that was
indexed using the SnowballAnalyzer, isn't it always going to return
stemmed words?
To confirm your assumption, I suppose I am trying to display the
values of the indexed field. It doesn't matter to me whether I count
"party" and "parties" as separate words or not but I cannot display
"parti" to a user as it's not a word.
I'm thinking I need a separate index with the field created using the
StandardAnalyzer unless there's some other trick with mixing Analyzers
I'm unaware of.
Thanks again for your help,
Chris
----- Original Message ----- From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Wednesday, January 11, 2006 8:32 AM
Subject: Re: top n words within a results set?
I believe the usual solution is to have a separate field on the same
document for display purposes (I am assumming you are trying to
display the values of the indexed field) that is not stemmed. The
tradeoff is in disk space, of course.
Chris Brown wrote:
Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
each term in the applicable field. It works quite well, there's just
one
glitch.
Some words like "party" and "picture" appear as "parti" and
"pictur". I am
using the SnowballAnalyzer, I suspect that's what's changing the words.
Short of maintaining a second index using a different analyzer, does
anyone
have any ideas?
----- Original Message ----- From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Monday, January 09, 2006 12:34 PM
Subject: Re: top n words within a results set?
You could use term vectors to accomplish this. Get your hits for
the website, then load the term vector for the field containing the
keywords and add up the frequencies
Chris Brown wrote:
Hello,
Is it possible to retrieve the top 'n' most often appearing words
within a search criteria? I've seen the High Frequency Terms code
in the sandbox but it works across the whole index.
To put this question into context: We're developing website that
hosts a user's photo website. Searches can be specific to a
particular user's website or be performed globally across one,
many or all websites. I've accomplished this with a field in the
index called website. What I'd like to do is give each user the
top ten words that appear on their website.
Thanks,
Chris Brown
http://www.orangepics.com/
--
-------------------------------------------------------------------
Grant Ingersoll Sr. Software Engineer Center for Natural Language
Processing Syracuse University School of Information Studies 337
Hinds Hall Syracuse, NY 13244
http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
-------------------------------------------------------------------
Grant Ingersoll Sr. Software Engineer Center for Natural Language
Processing Syracuse University School of Information Studies 337
Hinds Hall Syracuse, NY 13244
http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
-------------------------------------------------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
337 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]