Re: extracting text from an "encrypted" pdf

John Hewson Fri, 08 May 2015 14:54:07 -0700

Great!

— John


> On 8 May 2015, at 14:49, Tilman Hausherr <[email protected]> wrote:
> 
> Am 08.05.2015 um 23:47 schrieb John Hewson:
>> Can’t we make PDFBox open the document with an empty password? What’s the 
>> story for 2.0?
> 
> In 2.0 it opens immediately. Same in 1.8 when using the loadNonSeq().
> 
> Tilman
> 
>> 
>> — John
>> 
>>> On 8 May 2015, at 08:52, Tilman Hausherr <[email protected]> wrote:
>>> 
>>> Am 08.05.2015 um 17:51 schrieb Clemens Wyss DEV:
>>>> Thx for the very fast answer.
>>>>> new StandardDecryptionMaterial( password );
>>>> I have no password. The pdf is a public user manual.
>>> Use an empty password :-)
>>> 
>>> Tilman
>>> 
>>>>> That is TIKA, isn't it?
>>>> True
>>>> 
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Tilman Hausherr [mailto:[email protected]]
>>>> Gesendet: Freitag, 8. Mai 2015 17:44
>>>> An: [email protected]
>>>> Betreff: Re: extracting text from an "encrypted" pdf
>>>> 
>>>> Am 08.05.2015 um 17:36 schrieb Clemens Wyss DEV:
>>>>> When I try to extract an "encrypted" (which can be read in AcrobatReader) 
>>>>> document with:
>>>>> 
>>>>> pdfDocument = PDDocument.load( is );
>>>> add
>>>> if( document.isEncrypted() )
>>>> {
>>>>   StandardDecryptionMaterial sdm = new StandardDecryptionMaterial( 
>>>> password ); document.openProtection( sdm ); }
>>>> 
>>>> or use loadNonSeq()
>>>> 
>>>>> PDFTextStripper pdfStripper = new PDFTextStripper(); parsedText =
>>>>> pdfStripper.getText( pdfDocument );
>>>>> 
>>>>> I get an empty string, and " o.apache.pdfbox.pdfparser.PDFParser - 
>>>>> Document is encrypted" is logged.
>>>>> 
>>>>> When, on the other hand, I do:
>>>>> 
>>>>> ContentHandler handler = new BodyContentHandler( -1 ); ParseContext
>>>>> context = new ParseContext(); parser = new AutoDetectParser();
>>>>> context.set( Parser.class, parser );
>>>>>   parser.parse( is, handler, metadata, context ); parsedText =
>>>>> handler.toString();
>>>>> 
>>>>> I get to see the text/content of the very pdf.
>>>>> 
>>>>> 1) What ist he preferred way to extract text from a 
>>>>> pdf("-that-can-be-read-in-AcrobatReader")?
>>>> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/ExtractText.java?view=markup&sortby=date
>>>> 
>>>>>   2) Does the second approach possibly return "more than text"? Blobs? 
>>>>> Binary data?
>>>> That is TIKA, isn't it?
>>>> 
>>>> Tilman
>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected] 
>>> <mailto:[email protected]> 
>>> <mailto:[email protected] 
>>> <mailto:[email protected]>>
>>> For additional commands, e-mail: [email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] 
> <mailto:[email protected]>
> For additional commands, e-mail: [email protected] 
> <mailto:[email protected]>

Re: extracting text from an "encrypted" pdf

Reply via email to