[jira] Commented: (PDFBOX-586) Text Extraction on Android

Eddie B (JIRA) Sat, 05 Mar 2011 08:57:16 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002998#comment-13002998
 ]


Eddie B commented on PDFBOX-586:
--------------------------------

I have modified the open source PDFRenderer code to do much the same as this 
code... text extraction on Android devices specifically.
I have run into a limitation though, and PDFBox seems to have the same 
limitation, No support for encrypted documents.
PDFs encrypted with either AES or RC4 are not able to be parsed.  It appears to 
be a limitation of the ciphers that are available in the android OS.
The encryption is added when password security is applied to prevent editing 
for example. (in Acrobat: File - Properties - Security)
Has anyone had any luck opening pdfs with AES or RC4 encyption on an Android 
device? I will try to post some small pdfs here for testing.

> Text Extraction on Android
> --------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, 
> EncryptedFileTest_AES.pdf, EncryptedFileTest_RC4.pdf, Eval.pdf, 
> PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, 
> PDFBOX586-internals.txt, TestPDFBox.zip, internals.pdf
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for 
> the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the 
> extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
>   
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // 
> ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = 
> resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
>               
>    PDFTextStripper stripper = new PDFTextStripper(properties );
>               
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text 
> : "+s);
> Maybe I should use page.getContents().getStream()   or 
> stripper.getTextForRegion( "class1" )  or      stripper.writeText(doc, 
> outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-586) Text Extraction on Android

Reply via email to