Re: search trough single pdf document - return page number

Robert Muir Thu, 15 Oct 2009 07:57:54 -0700

if you just have a single pdf document (it seems from the subject line this
is the case), and you want to retrieve pages, maybe consider splitting the
PDF into single pages.


there is some functionality in pdfbox to do this.

then index each page as a single lucene document (so you will have 5000
lucene documents, one for each page). this way you could do a search, and
return page numbers easily.

On Thu, Oct 15, 2009 at 10:43 AM, IvanDrago <[email protected]> wrote:

>
> Thanks for the reply Erick.
>
> I would like to permanently index this content and search it
> multiple times so I would like a permanent copy and I want to search for
> different terms multiple
> times.
>
> My problem is that I dont know how to retrieve a page number where the
> searched string was found so
> if you could help on that issue, that would be great.
>
> // I would start like this:
> // This part of code would create the index, right?
> Document luceneDocument = LucenePDFDocument.getDocument( f );
> IndexWriter iwriter = new IndexWriter(index_dir, new StandardAnalyzer(),
> true);
> iwriter.addDocument(luceneDocument);
> iwriter.close();
>
> //and now for the search:
> Directory fsDir = FSDirectory.getDirectory(index_dir, false);
> IndexSearcher ind_search = new IndexSearcher(fsDir);
>
> //im not sure if "fieldname" would be the string that I'm searching?
> QueryParser parser = new QueryParser("fieldname", new StandardAnalyzer());
> Query query = parser.parse(q);
>
> Hits hits = ind_search.search(query);
>
> //and I'm stuck here. Dont know how to retrieve the page number???
>
>
>
>
>
>
>
> Erick Erickson wrote:
> >
> > It depends (tm). Do you want to permanently index this content and search
> > it
> > multiple times or is each search a one-off? If the latter, I'd look for
> > packages specific to handling PDF files. Although since Reader takes
> > forever
> > to search a document, so I suspect there's not much joy there.
> > If you want to parse the file once and search it many times, then yes,
> > Lucene can help a lot. You could conceivable do this in a memory index if
> > you didn't want a permanent copy. In this scheme, you'd index the file
> > before the first search then use the in-menory index until you were done
> > searching (assuming you wanted to search for different terms multiple
> > times). You'd have to do some record-keeping to remember what the start
> > and
> > end offset of each page was so you could deal with the case that a
> phrases
> > you search for started on one page and ended on another.....
> >
> > If this is off base, perhaps you could provide more details...
> >
> > Erick
> >
> > On Thu, Oct 15, 2009 at 5:06 AM, IvanDrago <[email protected]> wrote:
> >
> >>
> >> Hi,
> >>
> >> I have to search a single pdf document for requested string and if that
> >> string is found, I need to return a page number where that string was
> >> found.
> >> Requested string can be anything in a pdf document.
> >>
> >> It is a big document(abount 5000 pages) so I'm asking if that is
> possible
> >> with lucene.
> >>
> >> I'm using pdfbox class and i found a way to do it (searching with
> >> instring
> >> page by page) but it is too slow:
> >>
> >>        PDDocument pddDocument=PDDocument.load(f);
> >>
> >>        PDFTextStripper textStripper=new PDFTextStripper();
> >>        int lastpage = textStripper.getEndPage();
> >>        String page= null;
> >>        int found= 0;
> >>
> >>        for(int i=1; i<lastpage ; i++){
> >>            textStripper.setStartPage(i);
> >>            textStripper.setEndPage(i);
> >>
> >>            page = textStripper.getText(pddDocument);
> >>
> >>            found = page .indexOf(searchtext);
> >>
> >>            if (found>0) {returnpage= i; break;}
> >>        }
> >> ----------------
> >>
> >> Is there a way to speed up the search with lucene? Can I use indexing to
> >> solve this problem? thanks.
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25905217.html
> >> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/search-trough-single-pdf-document---return-page-number-tp25905217p25909908.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: search trough single pdf document - return page number

Reply via email to