Amin, Are you calling Close & Optimize after every addDocument?
I would suggest something like this try { while //this could be your looping through a data reader for example { indexWriter.addDocument(document); } } finally { commitAndOptimise() } HTH Shashi ----- Original Message ---- From: Amin Mohammed-Coleman <ami...@gmail.com> To: java-user@lucene.apache.org Sent: Saturday, January 3, 2009 4:02:52 AM Subject: Re: Search Problem Hi again! I think I may have found the problem but I was wondering if you could verify: I have the following for my indexer: public void add(Document document) { IndexWriter indexWriter = IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer()); try { indexWriter.addDocument(document); LOGGER.debug("Added Document:" + document + " to index"); commitAndOptimise(indexWriter); } catch (CorruptIndexException e) { throw new IllegalStateException(e); } catch (IOException e) { throw new IllegalStateException(e); } } the commitAndOptimise(indexWriter) looks like this: private void commitAndOptimise(IndexWriter indexWriter) throws CorruptIndexException,IOException { LOGGER.debug("Committing document and closing index writer"); indexWriter.optimize(); indexWriter.commit(); indexWriter.close(); } It seems as though if I comment out optimize then the overview tab in Luke for the rtf document looks like: 5 id 1234 3 body document 3 body body 1 body test 1 body rtf 1 name rtfDocumentToIndex.rtf 1 body new 1 path rtfDocumentToIndex.rtf 1 summary This is a 1 type RTF_INDEXER 1 body content This is more what I expected although "Amin Mohammed-Coleman" hasn't been stored in the index. Should I not be using indexWriter.optimize() ? I tried using the search function in luke and got the following results: body:test ---> returns result body:document ---> no result body:content ---> no result body:rtf ----> returns result Thanks again...sorry to be sending so many emails about this. I am in the process of designing and developing a prototype of a document and domain indexing/searching component and I would like to demo to the rest of my team. Cheers Amin On 3 Jan 2009, at 01:23, Erick Erickson wrote: > Well, your query results are consistent with what Luke is > reporting. So I'd go back and test your assumptions. I > suspect that you're not indexing what you think you are. > > For your test document, I'd just print out what you're indexing > and the field it's going into. *for each field*. that is, every time you > do a document.add(<field of some kind>), print out that data. I'm > pretty sure you'll find that you're not getting what you expect. For > instance, the call to: > > MetaDataEnum.BODY.getDescription() > > may be returning some nonsense. Or > bodyText.trim() > > isn't doing what you expect. > > Lucene is used by many folks, and errors of the magnitude you're > experiencing would be seen by many people and the user list would > be flooded with complaints if it were a Lucene issue at root. That > leaves the code you wrote as the most likely culprit. So try a very simple > test case with lots of debugging println's. I'm pretty sure you'll > find the underlying issue with some of your assumptions pretty quickly. > > Sorry I can't be more specific, but we'd have to see all of your code > and the test cases to do that.... > > Best > Erick > > On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <ami...@gmail.com>wrote: > >> Hi Erick >> >> Thanks for your reply. >> >> I have used luke to inspect the document and I am some what confused. For >> example when I view the index using the overview tab of Luke I get the >> following: >> >> 1 body test >> 1 id 1234 >> 1 name rtfDocumentToIndex.rtf >> 1 path rtfDocumentToIndex.rtf >> 1 summary This is a >> 1 type RTF_INDEXER >> 1 body rtf >> >> >> However when I view the document in the Document tab I get the full text >> that was extracted from the rft document (field:body) which is: >> >> This is a test rtf document that will be indexed. >> Amin Mohammed-Coleman >> >> I am using the StandardAnaylzer therefore I wouldnt expect the words >> document, indexed, Amin Mohammed-Coleman to be removed. >> >> I have referenced the Lucene In Action book and I can't see what I may be >> doing wrong. I would be happy to provide a testcase should it be required. >> When adding the body field to the document I am doing: >> >> Document document = new Document(); >> Field field = new >> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES, >> Field.Index.ANALYZED); >> document.add(field); >> >> >> >> When I run the search code the string "test" is the only word that returns >> a result (TopDocs), whereas the others do not (e.g. "amin", "document", >> "indexed"). >> >> Thanks again for your help and advice. >> >> >> Cheers >> Amin >> >> >> >> >> On 2 Jan 2009, at 21:20, Erick Erickson wrote: >> >> Casing is usually handled by the analyzer. Since you construct >>> the term query programmatically, it doesn't go through >>> any analyzers, thus is not converted into lower case for >>> searching as was done automatically for you when you >>> indexed using StandardAnalyzer. >>> >>> As for why you aren't getting hits, it's unclear to me. But >>> what I'd do is get a copy of Luke and examine your index >>> to see what's *really* there. This will often give you clues, >>> usually pointing to some kind of analyzer behavior that you >>> weren't expecting. >>> >>> Best >>> Erick >>> >>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <ami...@gmail.com >>>> wrote: >>> >>> Hi >>>> >>>> I have tried this and it doesn't work. I don't understand why using >>>> "amin" >>>> instead of "Amin" would work, is it not case insensitive? >>>> >>>> I tried "test" for field "body" and this works. Any other terms don't >>>> work >>>> for example: >>>> >>>> "document" >>>> "indexed" >>>> >>>> these are tokens that were extracted when creating the lucene document. >>>> >>>> >>>> Thanks for your reply. >>>> >>>> Cheers >>>> >>>> Amin >>>> >>>> >>>> On 2 Jan 2009, at 10:36, Chris Lu wrote: >>>> >>>> Basically Lucene stores analyzed tokens, and looks up for the matches >>>> >>>>> based >>>>> on the tokens. >>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new >>>>> Term("body", >>>>> "amin"), instead of new Term("body", "Amin"), to search. >>>>> >>>>> -- >>>>> Chris Lu >>>>> ------------------------- >>>>> Instant Scalable Full-Text Search On Any Database/Application >>>>> site: http://www.dbsight.net >>>>> demo: http://search.dbsight.com >>>>> Lucene Database Search in 3 minutes: >>>>> >>>>> >>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes >>>>> DBSight customer, a shopping comparison site, (anonymous per request) >>>>> got >>>>> 2.6 Million Euro funding! >>>>> >>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman < >>>>> ami...@gmail.com >>>>> >>>>>> wrote: >>>>>> >>>>> >>>>> Hi >>>>> >>>>>> >>>>>> Sorry I was using the StandardAnalyzer in this instance. >>>>>> >>>>>> Cheers >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote: >>>>>> >>>>>> You need to let us know the analyzer you are using. >>>>>> >>>>>> -- Chris Lu >>>>>>> ------------------------- >>>>>>> Instant Scalable Full-Text Search On Any Database/Application >>>>>>> site: http://www.dbsight.net >>>>>>> demo: http://search.dbsight.com >>>>>>> Lucene Database Search in 3 minutes: >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes >>>>>>> DBSight customer, a shopping comparison site, (anonymous per request) >>>>>>> got >>>>>>> 2.6 Million Euro funding! >>>>>>> >>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman < >>>>>>> ami...@gmail.com >>>>>>> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi >>>>>>>> >>>>>>>> >>>>>>>>> I have created a RTFHandler which takes a RTF file and creates a >>>>>>>>> lucene >>>>>>>>> Document which is indexed. The RTFHandler looks like something like >>>>>>>>> this: >>>>>>>>> >>>>>>>>> if (bodyText != null) { >>>>>>>>> Document document = new Document(); >>>>>>>>> Field field = new >>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(), >>>>>>>>> Field.Store.YES, >>>>>>>>> Field.Index.ANALYZED); >>>>>>>>> document.add(field); >>>>>>>>> >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> I am using Java Built in RTF text extraction. When I run my test to >>>>>>>>> verify that the document contains text that I expect this works >>>>>>>>> fine. >>>>>>>>> I >>>>>>>>> get >>>>>>>>> the following when I print the document: >>>>>>>>> >>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test >>>>>>>>> rtf >>>>>>>>> document that will be indexed. >>>>>>>>> >>>>>>>>> Amin Mohammed-Coleman> >>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf> >>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf> >>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER> >>>>>>>>> stored/uncompressed,indexed<summary:This is a >> >>>>>>>>> >>>>>>>>> >>>>>>>>> The problem is when I use the following to search I get no result: >>>>>>>>> >>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new Searchable[] >>>>>>>>> {rtfIndexSearcher}); >>>>>>>>> Term t = new Term("body", "Amin"); >>>>>>>>> TermQuery termQuery = new TermQuery(t); >>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, >>>>>>>>> 1); >>>>>>>>> System.out.println(topDocs.totalHits); >>>>>>>>> multiSearcher.close(); >>>>>>>>> >>>>>>>>> RftIndexSearcher is configured with the directory that holds rtf >>>>>>>>> documents. I have used Luke to look at the document and what I am >>>>>>>>> finding >>>>>>>>> in the overview tab is the following for the document: >>>>>>>>> >>>>>>>>> 1 body test >>>>>>>>> 1 id 1234 >>>>>>>>> 1 name rtfDocumentToIndex.rtf >>>>>>>>> 1 path rtfDocumentToIndex.rtf >>>>>>>>> 1 summary This is a >>>>>>>>> 1 type RTF_INDEXER >>>>>>>>> 1 body rtf >>>>>>>>> >>>>>>>>> >>>>>>>>> However on the Document tab I am getting (in the body field): >>>>>>>>> >>>>>>>>> This is a test rtf document that will be indexed. >>>>>>>>> >>>>>>>>> Amin Mohammed-Coleman >>>>>>>>> >>>>>>>>> >>>>>>>>> I would expect to get a hit using "Amin" or even "document". I am >>>>>>>>> not >>>>>>>>> sure whether the >>>>>>>>> line: >>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1); >>>>>>>>> >>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the top n >>>>>>>>> hits >>>>>>>>> for query." for search (Query query, int n) according to java docs. >>>>>>>>> >>>>>>>>> I would be grateful if someone may be able to advise on what I may >>>>>>>>> be >>>>>>>>> doing wrong. I am using Lucene 2.4.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Amin >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org