are you perhaps exceding this...

http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)


: Date: Fri, 09 Mar 2007 12:14:38 -0500
: From: "Walker, Keith 1" <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Words not found, large file indexing
:
: I'm having problems with queries not returning a hit when a document
: does in fact have those terms.  (I'm not worried about the ranking, just
: whether or not it's a hit.)
:
: Is anything wrong with the query syntax? (see below)  Also, words in the
: document's index (not the Lucene index) seemed less likely to be
: recognized.   I'm also wondering if anyone's run into problems with
: large files, since the one I'm using is 161MB, but boils down to 472KB
: as text.  The smaller file had no problems.
:
: Thanks for any advice,
: Keith
:
: Here are some of my test results on 2 different documents, with the test
: code below.
: query location of words in document (src: Acrobat)    Test 2
: http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
: text)
: +content:("Research-based")   310 instances   positive
: +content:("Organize Information Clearly")     4 instances     positive
:
: +content:("partitioning")     3 instances     negative
: +content:("distinguishing required")  1 instance in index     negative
:
: +content:("evaluators")       14 instances    negative
: +content:("distinguishing required" AND "evaluators") (see above)
: negative
: +content:("partitioning" AND "evaluators")    (see above)     negative
:
:
: automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
: latest)
: +content:("tentative")        several instances       positive
: +content:("tentative hits")   several instances       positive
: +content:("tentative" AND "hits")     several instances       positive
:
: +content:("tentative hits" AND "identification")      several
: instances     positive
:
:
: public static void testLuceneIndexing() throws EraException,
: IOException, ParseException {
:               File indexDir = new
: File("D:/kcw/test_data/gate_test/huge_files/index");
:               String filename =
: "D:/kcw/test_data/gate_test/huge_files/hhs.txt";
:               File file = new File(filename);
:               if (indexDir.exists()){
:                       deleteDirectory(indexDir);
:               }
:               IndexWriter writer = new IndexWriter(indexDir, new
: SimpleAnalyzer(),
:                               true);
:               Document doc = new Document();
:               doc.add(Field.Text("content", new FileReader(file)));
:               doc.add(Field.Keyword("filename",
: file.getCanonicalPath()));
:               System.out.println("before addDocument()");
:               long start = System.currentTimeMillis();
:               writer.addDocument(doc);
:               System.out.println("# docs indexed: " +
: writer.docCount());
:               writer.optimize();
:               writer.close();
:               System.out.println("Done indexing.  Duration(ms): " +
: (System.currentTimeMillis() - start));
:
:               IndexSearcher search = new
: IndexSearcher(indexDir.getCanonicalPath());
:
:               Query luceneQuery = null;
:
:               luceneQuery =
: QueryParser.parse("+content:(\"Research-based\")", "body",
:                               new SimpleAnalyzer());
:               System.out.println("Query= " +
: luceneQuery.toString("body"));
:
:               Hits hits = search.search(luceneQuery);
:               int resultLength = hits.length();
:               System.out.println("hit result = " + resultLength);
:       }
:
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to