Uwe, thanx for your comments. Following is the code I used in this case.
Could you pls. let me know where I have to insert UNLIMITED field length?
and how?
Tanx again!
Manjula
code--
*
public* *class* LuceneDemo {
*public* *static* *final* String *FILES_TO_INDEX_DIRECTORY* = "filesToIndex"
;
*public* *static* *final* String *INDEX_DIRECTORY* = "indexDirectory";
*public* *static* *final* String *FIELD_PATH* = "path";
*public* *static* *final* String *FIELD_CONTENTS* = "contents";
*public* *static* *void* main(String[] args) *throws* Exception {
*createIndex*();
//searchIndex("rice AND milk");
*searchIndex*("metaphysics");
//searchIndex("banana");
//searchIndex("foo");
}
*public* *static* *void* createIndex() *throws* CorruptIndexException,
LockObtainFailedException, IOException {
SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);
*boolean* recreateIndexIfExists = *true*;
IndexWriter indexWriter = *new* IndexWriter(*INDEX_DIRECTORY*, analyzer,
recreateIndexIfExists);
File dir = *new* File(*FILES_TO_INDEX_DIRECTORY*);
File[] files = dir.listFiles();
*for* (File file : files) {
Document document = *new* Document();
//contents#setOmitNorms(true);
String path = file.getCanonicalPath();
document.add(*new* Field(*FIELD_PATH*, path, Field.Store.*YES*, Field.Index.
UN_TOKENIZED,Field.TermVector.*YES*));
Reader reader = *new* FileReader(file);
document.add(*new* Field(*FIELD_CONTENTS*, reader));
indexWriter.addDocument(document);
}
indexWriter.optimize();
indexWriter.close();
}
*public* *static* *void* searchIndex(String searchString)
*throws*IOException, ParseException {
System.*out*.println("Searching for '" + searchString + "'");
Directory directory = FSDirectory.getDirectory(*INDEX_DIRECTORY*);
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = *new* IndexSearcher(indexReader);
SnowballAnalyzer analyzer = *new* SnowballAnalyzer( "English",
StopAnalyzer.ENGLISH_STOP_WORDS);
QueryParser queryParser = *new* QueryParser(*FIELD_CONTENTS*, analyzer);
Query query = queryParser.parse(searchString);
Hits hits = indexSearcher.search(query);
System.*out*.println("Number of hits: " + hits.length());
TopDocs results = indexSearcher.search(query,10);
ScoreDoc[] hits1 = results.scoreDocs;
*for* (ScoreDoc hit : hits1) {
Document doc = indexSearcher.doc(hit.doc);
//System.out.printf("%5.3f %s\n",hit.score,doc.get(FIELD_CONTENTS));
System.*out*.println(hit.score);
//Searcher.explain("rice",0);
//System.out.println(indexSearcher.explain(query, 0));
}
System.*out*.println(indexSearcher.explain(query, 0));
//System.out.println(indexSearcher.explain(query, 1));
//System.out.println(indexSearcher.explain(query, 2));
//System.out.println(indexSearcher.explain(query, 3));
Iterator<Hit> it = hits.iterator();
*while* (it.hasNext()) {
Hit hit = it.next();
Document document = hit.getDocument();
String path = document.get(*FIELD_PATH*);
System.*out*.println("Hit: " + path);
}
}
}
On Fri, Jul 9, 2010 at 1:06 PM, Uwe Schindler <[email protected]> wrote:
> Maybe you have MaxFieldLength.LIMITED instead of UNLIMITED? Then the number
> of terms per document is limited.
>
> The calculation precision is limited by the float norm encoding, but also
> if
> your analyzer removed stop words, so the norm is not what you exspect?
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
> > -----Original Message-----
> > From: manjula wijewickrema [mailto:[email protected]]
> > Sent: Friday, July 09, 2010 9:21 AM
> > To: [email protected]
> > Subject: scoring and index size
> >
> > Hi,
> >
> > I run a single programme to see the way of scoring by Lucene for single
> > indexed document. The explain() method gave me the following results.
> > *******************
> >
> > Searching for 'metaphysics'
> >
> > Number of hits: 1
> >
> > 0.030706111
> >
> > 0.030706111 = (MATCH) fieldWeight(contents:metaphys in 0), product of:
> >
> > 10.246951 = tf(termFreq(contents:metaphys)=105)
> >
> > 0.30685282 = idf(docFreq=1, maxDocs=1)
> >
> > 0.009765625 = fieldNorm(field=contents, doc=0)
> >
> > *****************
> >
> > But I encountered the following problems;
> >
> > 1) In this case, I did not change or done anything to Boost values. So
> that
> > should fieldNorm = 1/sqrt(terms in field)? (because I noticed that in
> Lucene
> > email archive, default boost values=1)
> >
> > 2) But, even if I manually calculate the value for fieldNorm (as
> =1/sqrt(terms
> > in field)), it doesn't match (approximately it matches) with the value
> with
> > given by the system for fieldNorm. Can this be due to encode/decode
> > precision loss of norm?
> >
> > 3) In my indexed document, my indexed document was consisted with total
> > number of 19078 words including 125 times of word 'metaphysics' (i.e my
> > query. I input single term query) . But as you can see in the above
> output,
> > system gives only 105 counts for word 'metaphysics'. But once I reduce
> some
> > part of my index document and count the number of 'metaphysics' words
> > and checked with the system results. I noticed that with reduction of
> text
> > from index document, system counts it correctly. Why this kind of
> > behaviour? Is there any limitation for the indexed documents?
> >
> > If somebody can pls. help me to solve these problems.
> >
> > Thanks!
> >
> > Manjula.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>