[Nutch-dev] Re: servlet Cached.java

Stephan Lagraulet Thu, 24 Mar 2005 02:13:29 -0800

Hi all!
I already submitted the highlight patch to Ben Litchfield, for PDF Box. So
the actual code is actually in PDFBox now (it should be released in 0.7.1
version soon)
This is  a code snippet to use this new feature :


COSDocument cosDoc = null;
        PDDocument pdDocument = null;
        InputStream is = null;
        OutputStreamWriter osw = new OutputStreamWriter(new
FileOutputStream(xmlFile));

        try {
            is = new URL(anURL).openStream();
            PDFParser parser = new PDFParser(is);
            parser.parse();
            cosDoc = parser.getDocument();
            pdDocument = new PDDocument(cosDoc);

            PDFHighlighter pdfHighlighter = new PDFHighlighter();
            pdfHighlighter.generateXMLHighlight(
                    pdDocument,
                    highlightStrings,
                    osw);
        } catch (Exception e) {
            throw new CCRRuntimeException(e);
        } finally {
            is.close();
            cosDoc.close();
            pdDocument.close();
            osw.close();
        }
This is generating the XML file used to highlight the searched words in
the PDF.
anURL is containing the URL of the PDF to parse.
highlightStrings is the array containing the words
xmlFile is the file to build.

I don't really have time to insert this code inside Nutch as I don't
really know where to apply this, but I think with this it could be easyly
done by an "insider"!

Stephan Lagraulet

On Wed, March 23, 2005 18:15, John X said:
> On Wed, Mar 23, 2005 at 11:53:21AM +0100, Stephan Lagraulet wrote:
>> Hi!
>> We could do this for certain type of documents.
>> But for PDF files, I think we should use a new feature provided by PDFBox,
>> PdfHighlighter.
>> This is actually using an Acrobat feature described here :
>> http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf
>>
>> When the user selects the link "View cache" or "View highlight", we could
>> generate the XML highlight file and use it to highlight the hits directly
>> inside the PDF.
>> That's even better than Google cache...
>> We could otherwise use Yahoo solution (launch the search engine inside
Acrobat reader -
>> http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf
/ search parameters).
>>
>> I know these are only solutions for PDFs but that's the format I'm working
>> on right now and I think its use is widespread so it might be useful to
implement these features.
>
> Could you provide a code snippet or better a patch?
> Thanks,
>
> John
>
>>
>> Stephan
>>
>>
>> On Wed, March 23, 2005 11:19, Andrzej Bialecki said:
>> > John X wrote:
>> >> Hi, All,
>> >>
>> >> Attached please find servlet Cached.java that serves raw Content of
any mime type. Current cached.jsp handles mime type text/* only. If
no objection, it is going to be committed in a few days.
>> >
>> > I think this would be quite useful.
>> >
>> > However, what I think is ultimately needed to match the features of
other search engines is not the ability to return the cached non-html
content (there might even be copyright issues with this function...),
but an html rendering of non-html content, a la Google's "View as
>> HTML"
>> > function.
>> >
>> > --
>> > Best regards,
>> > Andrzej Bialecki
>> >   ___. ___ ___ ___ _ _   __________________________________
>> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> > http://www.sigram.com  Contact: info at sigram dot com
>> >
>> >
>>
>>
>>
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
>






-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: servlet Cached.java

Reply via email to