are you perhaps exceding this... http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)
: Date: Fri, 09 Mar 2007 12:14:38 -0500 : From: "Walker, Keith 1" <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Words not found, large file indexing : : I'm having problems with queries not returning a hit when a document : does in fact have those terms. (I'm not worried about the ranking, just : whether or not it's a hit.) : : Is anything wrong with the query syntax? (see below) Also, words in the : document's index (not the Lucene index) seemed less likely to be : recognized. I'm also wondering if anyone's run into problems with : large files, since the one I'm using is 161MB, but boils down to 472KB : as text. The smaller file had no problems. : : Thanks for any advice, : Keith : : Here are some of my test results on 2 different documents, with the test : code below. : query location of words in document (src: Acrobat) Test 2 : http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted : text) : +content:("Research-based") 310 instances positive : +content:("Organize Information Clearly") 4 instances positive : : +content:("partitioning") 3 instances negative : +content:("distinguishing required") 1 instance in index negative : : +content:("evaluators") 14 instances negative : +content:("distinguishing required" AND "evaluators") (see above) : negative : +content:("partitioning" AND "evaluators") (see above) negative : : : automatic_format_identification.pdf (566KB, 53KB as text) v. 1 (not the : latest) : +content:("tentative") several instances positive : +content:("tentative hits") several instances positive : +content:("tentative" AND "hits") several instances positive : : +content:("tentative hits" AND "identification") several : instances positive : : : public static void testLuceneIndexing() throws EraException, : IOException, ParseException { : File indexDir = new : File("D:/kcw/test_data/gate_test/huge_files/index"); : String filename = : "D:/kcw/test_data/gate_test/huge_files/hhs.txt"; : File file = new File(filename); : if (indexDir.exists()){ : deleteDirectory(indexDir); : } : IndexWriter writer = new IndexWriter(indexDir, new : SimpleAnalyzer(), : true); : Document doc = new Document(); : doc.add(Field.Text("content", new FileReader(file))); : doc.add(Field.Keyword("filename", : file.getCanonicalPath())); : System.out.println("before addDocument()"); : long start = System.currentTimeMillis(); : writer.addDocument(doc); : System.out.println("# docs indexed: " + : writer.docCount()); : writer.optimize(); : writer.close(); : System.out.println("Done indexing. Duration(ms): " + : (System.currentTimeMillis() - start)); : : IndexSearcher search = new : IndexSearcher(indexDir.getCanonicalPath()); : : Query luceneQuery = null; : : luceneQuery = : QueryParser.parse("+content:(\"Research-based\")", "body", : new SimpleAnalyzer()); : System.out.println("Query= " + : luceneQuery.toString("body")); : : Hits hits = search.search(luceneQuery); : int resultLength = hits.length(); : System.out.println("hit result = " + resultLength); : } : : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]