Re: top n words within a results set?

Grant Ingersoll Wed, 11 Jan 2006 08:08:35 -0800

Hey Chris,

There is just such an analyzer, called the PerFieldAnalyzerWrapper. Thetrick is the Analyzer always passes in the Field name when it gets theTokenStream,


-Grant

Chris Brown wrote:

Bear with me, I might be missing something.... My documents getindexed ( writer.addDocument(doc) ) with one IndexWriter created usingone Analyzer (the SnowballAnalyzer). So unless you can somehow use adifferent Analyzer per field I don't see how the second field willhelp. If I get the TermFreqVector for a field for a document that wasindexed using the SnowballAnalyzer, isn't it always going to returnstemmed words?
To confirm your assumption, I suppose I am trying to display thevalues of the indexed field. It doesn't matter to me whether I count"party" and "parties" as separate words or not but I cannot display"parti" to a user as it's not a word.
I'm thinking I need a separate index with the field created using theStandardAnalyzer unless there's some other trick with mixing AnalyzersI'm unaware of.
Thanks again for your help,
Chris

----- Original Message ----- From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, January 11, 2006 8:32 AM
Subject: Re: top n words within a results set?
I believe the usual solution is to have a separate field on the samedocument for display purposes (I am assumming you are trying todisplay the values of the indexed field) that is not stemmed. Thetradeoff is in disk space, of course.
Chris Brown wrote:
Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
each term in the applicable field. It works quite well, there's justone
glitch.
Some words like "party" and "picture" appear as "parti" and"pictur". I am
using the SnowballAnalyzer, I suspect that's what's changing the words.
Short of maintaining a second index using a different analyzer, doesanyone
have any ideas?

----- Original Message ----- From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, January 09, 2006 12:34 PM
Subject: Re: top n words within a results set?
You could use term vectors to accomplish this. Get your hits forthe website, then load the term vector for the field containing thekeywords and add up the frequencies
Chris Brown wrote:
Hello,
Is it possible to retrieve the top 'n' most often appearing wordswithin a search criteria? I've seen the High Frequency Terms codein the sandbox but it works across the whole index.
To put this question into context: We're developing website thathosts a user's photo website. Searches can be specific to aparticular user's website or be performed globally across one,many or all websites. I've accomplished this with a field in theindex called website. What I'd like to do is give each user thetop ten words that appear on their website.
Thanks,
Chris Brown

http://www.orangepics.com/
--
-------------------------------------------------------------------Grant Ingersoll Sr. Software Engineer Center for Natural LanguageProcessing Syracuse University School of Information Studies 337Hinds Hall Syracuse, NY 13244
http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
-------------------------------------------------------------------Grant Ingersoll Sr. Software Engineer Center for Natural LanguageProcessing Syracuse University School of Information Studies 337Hinds Hall Syracuse, NY 13244
http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--

-------------------------------------------------------------------Grant IngersollSr. Software EngineerCenter for Natural Language ProcessingSyracuse UniversitySchool of Information Studies337 Hinds HallSyracuse, NY 13244http://www.cnlp.orgVoice: 315-443-5484Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: top n words within a results set?

Reply via email to