Thanks for the reply Erick. I would like to permanently index this content and search it multiple times so I would like a permanent copy and I want to search for different terms multiple times.
My problem is that I dont know how to retrieve a page number where the searched string was found so if you could help on that issue, that would be great. // I would start like this: // This part of code would create the index, right? Document luceneDocument = LucenePDFDocument.getDocument( f ); IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(), true); iwriter.addDocument(luceneDocument); iwriter.close(); //and now for the search: Directory fsDir = FSDirectory.getDirectory(index_dir, false); IndexSearcher ind_search = new IndexSearcher(fsDir); //im not sure if "fieldname" would be the string that I'm searching? QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer()); Query query = parser.parse(q); Hits hits = ind_search.search(query); //and I'm stuck here. Dont know how to retrieve the page number??? Erick Erickson wrote: > > It depends (tm). Do you want to permanently index this content and search > it > multiple times or is each search a one-off? If the latter, I'd look for > packages specific to handling PDF files. Although since Reader takes > forever > to search a document, so I suspect there's not much joy there. > If you want to parse the file once and search it many times, then yes, > Lucene can help a lot. You could conceivable do this in a memory index if > you didn't want a permanent copy. In this scheme, you'd index the file > before the first search then use the in-menory index until you were done > searching (assuming you wanted to search for different terms multiple > times). You'd have to do some record-keeping to remember what the start > and > end offset of each page was so you could deal with the case that a phrases > you search for started on one page and ended on another..... > > If this is off base, perhaps you could provide more details... > > Erick > > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idrag...@gmail.com> wrote: > >> >> Hi, >> >> I have to search a single pdf document for requested string and if that >> string is found, I need to return a page number where that string was >> found. >> Requested string can be anything in a pdf document. >> >> It is a big document(abount 5000 pages) so I'm asking if that is possible >> with lucene. >> >> I'm using pdfbox class and i found a way to do it (searching with >> instring >> page by page) but it is too slow: >> >> PDDocument pddDocument=PDDocument.load(f); >> >> PDFTextStripper textStripper=new PDFTextStripper(); >> int lastpage = textStripper.getEndPage(); >> String page= null; >> int found= 0; >> >> for(int i=1; i<lastpage ; i++){ >> textStripper.setStartPage(i); >> textStripper.setEndPage(i); >> >> page = textStripper.getText(pddDocument); >> >> found = page .indexOf(searchtext); >> >> if (found>0) {returnpage= i; break;} >> } >> ---------------- >> >> Is there a way to speed up the search with lucene? Can I use indexing to >> solve this problem? thanks. >> >> -- >> View this message in context: >> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > -- View this message in context: http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org