Words not found, large file indexing

Walker, Keith 1 Fri, 09 Mar 2007 09:15:14 -0800

I'm having problems with queries not returning a hit when a document
does in fact have those terms.  (I'm not worried about the ranking, just
whether or not it's a hit.)


Is anything wrong with the query syntax? (see below)  Also, words in the
document's index (not the Lucene index) seemed less likely to be
recognized.   I'm also wondering if anyone's run into problems with
large files, since the one I'm using is 161MB, but boils down to 472KB
as text.  The smaller file had no problems.

Thanks for any advice,
Keith

Here are some of my test results on 2 different documents, with the test
code below.
query   location of words in document (src: Acrobat)    Test 2  
http://usability.gov/pdfs/guidelines_book.pdf (161MB, 472 as extracted
text)                   
+content:("Research-based")     310 instances   positive        
+content:("Organize Information Clearly")       4 instances     positive

+content:("partitioning")       3 instances     negative        
+content:("distinguishing required")    1 instance in index     negative

+content:("evaluators") 14 instances    negative        
+content:("distinguishing required" AND "evaluators")   (see above)
negative        
+content:("partitioning" AND "evaluators")      (see above)     negative

                        
automatic_format_identification.pdf (566KB, 53KB as text)  v. 1 (not the
latest)                 
+content:("tentative")  several instances       positive        
+content:("tentative hits")     several instances       positive        
+content:("tentative" AND "hits")       several instances       positive

+content:("tentative hits" AND "identification")        several
instances       positive        


public static void testLuceneIndexing() throws EraException,
IOException, ParseException {
                File indexDir = new
File("D:/kcw/test_data/gate_test/huge_files/index");
                String filename =
"D:/kcw/test_data/gate_test/huge_files/hhs.txt";
                File file = new File(filename);
                if (indexDir.exists()){
                        deleteDirectory(indexDir);
                }
                IndexWriter writer = new IndexWriter(indexDir, new
SimpleAnalyzer(),
                                true);
                Document doc = new Document();
                doc.add(Field.Text("content", new FileReader(file)));
                doc.add(Field.Keyword("filename",
file.getCanonicalPath()));
                System.out.println("before addDocument()");
                long start = System.currentTimeMillis();
                writer.addDocument(doc);
                System.out.println("# docs indexed: " +
writer.docCount());             
                writer.optimize();
                writer.close();
                System.out.println("Done indexing.  Duration(ms): " +
(System.currentTimeMillis() - start));

                IndexSearcher search = new
IndexSearcher(indexDir.getCanonicalPath());

                Query luceneQuery = null;
                
                luceneQuery =
QueryParser.parse("+content:(\"Research-based\")", "body",
                                new SimpleAnalyzer());
                System.out.println("Query= " +
luceneQuery.toString("body"));

                Hits hits = search.search(luceneQuery);
                int resultLength = hits.length();
                System.out.println("hit result = " + resultLength);
        }

Words not found, large file indexing

Reply via email to