Amin,
Are you calling Close & Optimize after every addDocument?
I would suggest something like this
try
{
while //this could be your looping through a data reader for example
{
indexWriter.addDocument(document);
}
}
finally
{
commitAndOptimise()
}
HTH
Shashi
----- Original Message ----
From: Amin Mohammed-Coleman <[email protected]>
To: [email protected]
Sent: Saturday, January 3, 2009 4:02:52 AM
Subject: Re: Search Problem
Hi again!
I think I may have found the problem but I was wondering if you could verify:
I have the following for my indexer:
public void add(Document document) {
IndexWriter indexWriter =
IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
try {
indexWriter.addDocument(document);
LOGGER.debug("Added Document:" + document + " to index");
commitAndOptimise(indexWriter);
} catch (CorruptIndexException e) {
throw new IllegalStateException(e);
} catch (IOException e) {
throw new IllegalStateException(e);
}
}
the commitAndOptimise(indexWriter) looks like this:
private void commitAndOptimise(IndexWriter indexWriter) throws
CorruptIndexException,IOException {
LOGGER.debug("Committing document and closing index writer");
indexWriter.optimize();
indexWriter.commit();
indexWriter.close();
}
It seems as though if I comment out optimize then the overview tab in Luke for
the rtf document looks like:
5 id 1234
3 body document
3 body body
1 body test
1 body rtf
1 name rtfDocumentToIndex.rtf
1 body new
1 path rtfDocumentToIndex.rtf
1 summary This is a
1 type RTF_INDEXER
1 body content
This is more what I expected although "Amin Mohammed-Coleman" hasn't been
stored in the index. Should I not be using indexWriter.optimize() ?
I tried using the search function in luke and got the following results:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result
Thanks again...sorry to be sending so many emails about this. I am in the
process of designing and developing a prototype of a document and domain
indexing/searching component and I would like to demo to the rest of my team.
Cheers
Amin
On 3 Jan 2009, at 01:23, Erick Erickson wrote:
> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
>
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
>
> MetaDataEnum.BODY.getDescription()
>
> may be returning some nonsense. Or
> bodyText.trim()
>
> isn't doing what you expect.
>
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty quickly.
>
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <[email protected]>wrote:
>
>> Hi Erick
>>
>> Thanks for your reply.
>>
>> I have used luke to inspect the document and I am some what confused. For
>> example when I view the index using the overview tab of Luke I get the
>> following:
>>
>> 1 body test
>> 1 id 1234
>> 1 name rtfDocumentToIndex.rtf
>> 1 path rtfDocumentToIndex.rtf
>> 1 summary This is a
>> 1 type RTF_INDEXER
>> 1 body rtf
>>
>>
>> However when I view the document in the Document tab I get the full text
>> that was extracted from the rft document (field:body) which is:
>>
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>>
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>>
>> I have referenced the Lucene In Action book and I can't see what I may be
>> doing wrong. I would be happy to provide a testcase should it be required.
>> When adding the body field to the document I am doing:
>>
>> Document document = new Document();
>> Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
>> Field.Index.ANALYZED);
>> document.add(field);
>>
>>
>>
>> When I run the search code the string "test" is the only word that returns
>> a result (TopDocs), whereas the others do not (e.g. "amin", "document",
>> "indexed").
>>
>> Thanks again for your help and advice.
>>
>>
>> Cheers
>> Amin
>>
>>
>>
>>
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>>
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>>
>>> Best
>>> Erick
>>>
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <[email protected]
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> I have tried this and it doesn't work. I don't understand why using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>>
>>>> I tried "test" for field "body" and this works. Any other terms don't
>>>> work
>>>> for example:
>>>>
>>>> "document"
>>>> "indexed"
>>>>
>>>> these are tokens that were extracted when creating the lucene document.
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>>
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>
>>>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>>>
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> [email protected]
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>
>>>>>> You need to let us know the analyzer you are using.
>>>>>>
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>>
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> [email protected]
>>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have created a RTFHandler which takes a RTF file and creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed. The RTFHandler looks like something like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> if (bodyText != null) {
>>>>>>>>> Document document = new Document();
>>>>>>>>> Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>> document.add(field);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I am using Java Built in RTF text extraction. When I run my test to
>>>>>>>>> verify that the document contains text that I expect this works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>>
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This is a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is when I use the following to search I get no result:
>>>>>>>>>
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>> Term t = new Term("body", "Amin");
>>>>>>>>> TermQuery termQuery = new TermQuery(t);
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>> System.out.println(topDocs.totalHits);
>>>>>>>>> multiSearcher.close();
>>>>>>>>>
>>>>>>>>> RftIndexSearcher is configured with the directory that holds rtf
>>>>>>>>> documents. I have used Luke to look at the document and what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>
>>>>>>>>> 1 body test
>>>>>>>>> 1 id 1234
>>>>>>>>> 1 name rtfDocumentToIndex.rtf
>>>>>>>>> 1 path rtfDocumentToIndex.rtf
>>>>>>>>> 1 summary This is a
>>>>>>>>> 1 type RTF_INDEXER
>>>>>>>>> 1 body rtf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>>
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would expect to get a hit using "Amin" or even "document". I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according to java docs.
>>>>>>>>>
>>>>>>>>> I would be grateful if someone may be able to advise on what I may
>>>>>>>>> be
>>>>>>>>> doing wrong. I am using Lucene 2.4.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]