Hi all! I already submitted the highlight patch to Ben Litchfield, for PDF Box. So the actual code is actually in PDFBox now (it should be released in 0.7.1 version soon) This is a code snippet to use this new feature :
COSDocument cosDoc = null; PDDocument pdDocument = null; InputStream is = null; OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(xmlFile)); try { is = new URL(anURL).openStream(); PDFParser parser = new PDFParser(is); parser.parse(); cosDoc = parser.getDocument(); pdDocument = new PDDocument(cosDoc); PDFHighlighter pdfHighlighter = new PDFHighlighter(); pdfHighlighter.generateXMLHighlight( pdDocument, highlightStrings, osw); } catch (Exception e) { throw new CCRRuntimeException(e); } finally { is.close(); cosDoc.close(); pdDocument.close(); osw.close(); } This is generating the XML file used to highlight the searched words in the PDF. anURL is containing the URL of the PDF to parse. highlightStrings is the array containing the words xmlFile is the file to build. I don't really have time to insert this code inside Nutch as I don't really know where to apply this, but I think with this it could be easyly done by an "insider"! Stephan Lagraulet On Wed, March 23, 2005 18:15, John X said: > On Wed, Mar 23, 2005 at 11:53:21AM +0100, Stephan Lagraulet wrote: >> Hi! >> We could do this for certain type of documents. >> But for PDF files, I think we should use a new feature provided by PDFBox, >> PdfHighlighter. >> This is actually using an Acrobat feature described here : >> http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf >> >> When the user selects the link "View cache" or "View highlight", we could >> generate the XML highlight file and use it to highlight the hits directly >> inside the PDF. >> That's even better than Google cache... >> We could otherwise use Yahoo solution (launch the search engine inside Acrobat reader - >> http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf / search parameters). >> >> I know these are only solutions for PDFs but that's the format I'm working >> on right now and I think its use is widespread so it might be useful to implement these features. > > Could you provide a code snippet or better a patch? > Thanks, > > John > >> >> Stephan >> >> >> On Wed, March 23, 2005 11:19, Andrzej Bialecki said: >> > John X wrote: >> >> Hi, All, >> >> >> >> Attached please find servlet Cached.java that serves raw Content of any mime type. Current cached.jsp handles mime type text/* only. If no objection, it is going to be committed in a few days. >> > >> > I think this would be quite useful. >> > >> > However, what I think is ultimately needed to match the features of other search engines is not the ability to return the cached non-html content (there might even be copyright issues with this function...), but an html rendering of non-html content, a la Google's "View as >> HTML" >> > function. >> > >> > -- >> > Best regards, >> > Andrzej Bialecki >> > ___. ___ ___ ___ _ _ __________________________________ >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> > ___|||__|| \| || | Embedded Unix, System Integration >> > http://www.sigram.com Contact: info at sigram dot com >> > >> > >> >> >> > __________________________________________ > http://www.neasys.com - A Good Place to Be > Come to visit us today! > ------------------------------------------------------- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers