It depends (tm). Do you want to permanently index this content and search it multiple times or is each search a one-off? If the latter, I'd look for packages specific to handling PDF files. Although since Reader takes forever to search a document, so I suspect there's not much joy there. If you want to parse the file once and search it many times, then yes, Lucene can help a lot. You could conceivable do this in a memory index if you didn't want a permanent copy. In this scheme, you'd index the file before the first search then use the in-menory index until you were done searching (assuming you wanted to search for different terms multiple times). You'd have to do some record-keeping to remember what the start and end offset of each page was so you could deal with the case that a phrases you search for started on one page and ended on another.....
If this is off base, perhaps you could provide more details... Erick On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <idrag...@gmail.com> wrote: > > Hi, > > I have to search a single pdf document for requested string and if that > string is found, I need to return a page number where that string was > found. > Requested string can be anything in a pdf document. > > It is a big document(abount 5000 pages) so I'm asking if that is possible > with lucene. > > I'm using pdfbox class and i found a way to do it (searching with instring > page by page) but it is too slow: > > PDDocument pddDocument=PDDocument.load(f); > > PDFTextStripper textStripper=new PDFTextStripper(); > int lastpage = textStripper.getEndPage(); > String page= null; > int found= 0; > > for(int i=1; i<lastpage ; i++){ > textStripper.setStartPage(i); > textStripper.setEndPage(i); > > page = textStripper.getText(pddDocument); > > found = page .indexOf(searchtext); > > if (found>0) {returnpage= i; break;} > } > ---------------- > > Is there a way to speed up the search with lucene? Can I use indexing to > solve this problem? thanks. > > -- > View this message in context: > http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >