I'm having problems with queries not returning a hit when a document does in fact have those terms. (I'm not worried about the ranking, just whether or not it's a hit.)
Is anything wrong with the query syntax? (see below) Also, words in the document's index (not the Lucene index) seemed less likely to be recognized. I'm also wondering if anyone's run into problems with large files, since the one I'm using is 161MB, but boils down to 472KB as text. The smaller file had no problems. Thanks for any advice, Keith Here are some of my test results on 2 different documents, with the test code below. query location of words in document (src: Acrobat) Test 2 http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted text) +content:("Research-based") 310 instances positive +content:("Organize Information Clearly") 4 instances positive +content:("partitioning") 3 instances negative +content:("distinguishing required") 1 instance in index negative +content:("evaluators") 14 instances negative +content:("distinguishing required" AND "evaluators") (see above) negative +content:("partitioning" AND "evaluators") (see above) negative automatic_format_identification.pdf (566KB, 53KB as text) v. 1 (not the latest) +content:("tentative") several instances positive +content:("tentative hits") several instances positive +content:("tentative" AND "hits") several instances positive +content:("tentative hits" AND "identification") several instances positive public static void testLuceneIndexing() throws EraException, IOException, ParseException { File indexDir = new File("D:/kcw/test_data/gate_test/huge_files/index"); String filename = "D:/kcw/test_data/gate_test/huge_files/hhs.txt"; File file = new File(filename); if (indexDir.exists()){ deleteDirectory(indexDir); } IndexWriter writer = new IndexWriter(indexDir, new SimpleAnalyzer(), true); Document doc = new Document(); doc.add(Field.Text("content", new FileReader(file))); doc.add(Field.Keyword("filename", file.getCanonicalPath())); System.out.println("before addDocument()"); long start = System.currentTimeMillis(); writer.addDocument(doc); System.out.println("# docs indexed: " + writer.docCount()); writer.optimize(); writer.close(); System.out.println("Done indexing. Duration(ms): " + (System.currentTimeMillis() - start)); IndexSearcher search = new IndexSearcher(indexDir.getCanonicalPath()); Query luceneQuery = null; luceneQuery = QueryParser.parse("+content:(\"Research-based\")", "body", new SimpleAnalyzer()); System.out.println("Query= " + luceneQuery.toString("body")); Hits hits = search.search(luceneQuery); int resultLength = hits.length(); System.out.println("hit result = " + resultLength); }