I believe that if you enter an identical document twice, when you search, you will get it back twice. If you don't want duplicate results, I think you will need to keep a hashset of the terms you have already indexed, and not add the document of the lowercase values are equal (or something along those lines)
Dan -----Original Message----- From: Thomas Kr�mer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 11, 2003 3:01 PM To: Lucene Users List Subject: build a case insensitive index Hello Lucene Users i need a document term matrix to initialize a neural network, that i want to use to integrate user feedback in the retrieval process. until now, i am using a slightly modified class of the IndexHTML example. how can i create an index of all the terms in a collection without "term" and "Term" being indexed twice? in the example, a standard analyzer is used, and in the documentation it sais : Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter. So, why do i get double entries for terms in upper- and lower case writing? Regards. Thomas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
