Great! — John
> On 8 May 2015, at 14:49, Tilman Hausherr <[email protected]> wrote: > > Am 08.05.2015 um 23:47 schrieb John Hewson: >> Can’t we make PDFBox open the document with an empty password? What’s the >> story for 2.0? > > In 2.0 it opens immediately. Same in 1.8 when using the loadNonSeq(). > > Tilman > >> >> — John >> >>> On 8 May 2015, at 08:52, Tilman Hausherr <[email protected]> wrote: >>> >>> Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV: >>>> Thx for the very fast answer. >>>>> new StandardDecryptionMaterial( password ); >>>> I have no password. The pdf is a public user manual. >>> Use an empty password :-) >>> >>> Tilman >>> >>>>> That is TIKA, isn't it? >>>> True >>>> >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: Tilman Hausherr [mailto:[email protected]] >>>> Gesendet: Freitag, 8. Mai 2015 17:44 >>>> An: [email protected] >>>> Betreff: Re: extracting text from an "encrypted" pdf >>>> >>>> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV: >>>>> When I try to extract an "encrypted" (which can be read in AcrobatReader) >>>>> document with: >>>>> >>>>> pdfDocument = PDDocument.load( is ); >>>> add >>>> if( document.isEncrypted() ) >>>> { >>>> StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( >>>> password ); document.openProtection( sdm ); } >>>> >>>> or use loadNonSeq() >>>> >>>>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText = >>>>> pdfStripper.getText( pdfDocument ); >>>>> >>>>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - >>>>> Document is encrypted" is logged. >>>>> >>>>> When, on the other hand, I do: >>>>> >>>>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext >>>>> context = new ParseContext(); parser = new AutoDetectParser(); >>>>> context.set( Parser.class, parser ); >>>>> parser.parse( is, handler, metadata, context ); parsedText = >>>>> handler.toString(); >>>>> >>>>> I get to see the text/content of the very pdf. >>>>> >>>>> 1) What ist he preferred way to extract text from a >>>>> pdf("-that-can-be-read-in-AcrobatReader")? >>>> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date >>>> >>>>> 2) Does the second approach possibly return "more than text"? Blobs? >>>>> Binary data? >>>> That is TIKA, isn't it? >>>> >>>> Tilman >>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> <mailto:[email protected]> >>> <mailto:[email protected] >>> <mailto:[email protected]>> >>> For additional commands, e-mail: [email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > <mailto:[email protected]> > For additional commands, e-mail: [email protected] > <mailto:[email protected]>
