Re: Lucene - PDFBox

田春峰 Wed, 25 May 2005 15:29:43 -0700

hi,
   I agree with Ben Litchfield, 
   Before feed extracted text into lucene indexer ,
should ched the extracted text ,and for me , now using


java org.pdfbox.ExtractText <pdf-file> to get the text
 in pdf .

[quote]   

"Ben Litchfield" <[EMAIL PROTECTED]>  

Can you run the following command line application on
the PDF to verify
that the extracted text is correct

java org.pdfbox.ExtractText <pdf-file>

Ben
[/quote]


On Wed, 



--- Thomas X Hoban <[EMAIL PROTECTED]>写道:

>     
> 
> First, I am new to Lucene.
> 
> Is there anyone out there who has had trouble
> getting hits when running phrase queries against an
> index that contains content from PDF files.  For PDF
> documents, I create the document using
> LucenePDFDocument.getDocument(file) and then add it
> to the index.  For non-pdf documents, I create the
> document using FileDocument.Document(file).
> 
> For instance, I add documents with the following
> text:
> 
> pdf1.pdf -- "Dave has good taste"
> pdf2.pdf -- "Tom has good taste"
> word1.doc -- "Liz has bad taste"
> word2.doc -- "Troy has bad taste"
> 
> When I search content for the following strings:
> 
>     has good taste
>       get expected results with hits on pdf1.doc,
> pdf2.doc, word1.doc and word2.doc
> 
>     "has good taste"
>        get unexpected result: 0 hits
> 
>     "has bad taste"
>        get expected results with hits on word1.doc
> and word2.doc
>  
> It seems that searching for individual words work
> fine for both PDF and non-pdf files.  However,
> searching on a phrase (enclosed in quotes) works on
> non-pdf files but not on files parsed with the
> LucenePDFDocument class.
> 
> Can anyone offer advise?
> 
> Below is code for index creation.  It is the demo
> IndexFiles class provided with Lucene along with
> some changes...
> 
> import
>
org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Document;
> 
> import java.io.File;
> import java.io.FileNotFoundException;
> import java.io.IOException;
> import java.util.Date;
> 
> //import javax.activation.MimetypesFileTypeMap;
> 
> import
> org.pdfbox.searchengine.lucene.LucenePDFDocument;
> 
> 
> class IndexFiles {
>   public static void main(String[] args) throws
> IOException {
>     String usage = "java " + IndexFiles.class + "
> <root_directory>";
>     if (args.length == 0) {
>       System.err.println("Usage: " + usage);
>       System.exit(1);
>     }
> 
>     Date start = new Date();
>     try {
>       IndexWriter writer = new IndexWriter("index",
> new StandardAnalyzer(), true);
>       indexDocs(writer, new File(args[0]));
> 
>       writer.optimize();
>       writer.close();
> 
>       Date end = new Date();
> 
>       System.out.print(end.getTime() -
> start.getTime());
>       System.out.println(" total milliseconds");
> 
>     } catch (IOException e) {
>       System.out.println(" caught a " + e.getClass()
> +
>        "\n with message: " + e.getMessage());
>     }
>   }
> 
>   public static void indexDocs(IndexWriter writer,
> File file)
>     throws IOException {
>     // do not try to index files that cannot be read
> 
>     if (file.canRead()) {
>       if (file.isDirectory()) {
>         String[] files = file.list();
>         // an IO error could occur
>         if (files != null) {
>           for (int i = 0; i < files.length; i++) {
>             indexDocs(writer, new File(file,
> files[i]));
>           }
>         }
>       } else {
>         System.out.println("adding " + file);
>         try {
> 
>           Document doc = null;
>           if (file.getName().indexOf(".pdf") >= 0)
>               //
>
writer.addDocument(LucenePDFDocument.getDocument(file));
>               doc =
> LucenePDFDocument.getDocument(file);
>           else
>               doc = FileDocument.Document(file);
> 
>           Field field = null; 
>           if (file.getPath().indexOf("case1") >=0)
>               field = new Field("caseid", "1",
> false, true, false);
>           else if (file.getPath().indexOf("case2")
> >=0)
>               field = new Field("caseid", "2",
> false, true, false);
>           else if (file.getPath().indexOf("case3")
> >=0)
>               field = new Field("caseid", "3",
> false, true, false);
>           else 
>               field = new Field("caseid", "0",
> false, true, false);
> 
>           doc.add(field);
> 
>           writer.addDocument(doc);
>         }
>         // at least on windows, some temporary files
> raise this exception with an "access denied" message
>         // checking if the file can be read doesn't
> help
>         catch (FileNotFoundException fnfe) {
>           ;
>         }
>       }
>     }
>   }
> }
> 
> 
> Here is the SearchFiles class with some minor
> modifications...
> 
> import java.io.IOException;
> import java.io.BufferedReader;
> import java.io.InputStreamReader;
> import java.util.StringTokenizer;
> 
> import org.apache.lucene.analysis.Analyzer;
> import
>
org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.search.Searcher;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.BooleanQuery;
> import org.apache.lucene.search.PhraseQuery;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.queryParser.ParseException;
> 
> class SearchFiles {
> 
>   private static Query getCaseQuery(String line,
> Analyzer analyzer)
>   throws ParseException {
>       BooleanQuery bq = new BooleanQuery();
>       StringTokenizer st = new
> StringTokenizer(line);
>       Query query = QueryParser.parse(line,
> "contents", analyzer);
>       String caseId = null;
>       while (st.hasMoreTokens()) {
>           caseId = st.nextToken();
>           System.out.println("build case query for "
> + caseId);
>           
>           query = QueryParser.parse(caseId,
> "caseid", analyzer);
>           bq.add(query, false, false);
>       }
> 
>       return bq;
>   }
>   public static void main(String[] args) {
> 
=== message truncated ===



自动签名：
请使用机器人服务:
msn机器人: [EMAIL PROTECTED]
qq机器人: 443803193
blog: http://blog.csdn.net/accesine960
多么乐主页：homepage: http://www.domolo.com
 










_________________________________________________________
Do You Yahoo!?
150万曲MP3疯狂搜，带您闯入音乐殿堂
http://music.yisou.com/
美女明星应有尽有，搜遍美图、艳图和酷图
http://image.yisou.com
1G就是1000兆，雅虎电邮自助扩容！
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene - PDFBox

Reply via email to