Re: [jira] Commented: (PDFBOX-586) Text Extraction on Android

Adam Mon, 07 Mar 2011 09:07:50 -0800

Eddie,

I'm also on the bouncycastle mailing list and  saw they have a version 
which was specifically made for embedded devices.  The reason a different 
build it required is because there are core Java files missing from the 
JVM (this was done intentionally to reduce overhead).  If you use that jar 
file instead of the normal bouncycastle one, it may resolve your issues.


---- 
Thanks,
Adam



From:
"Eddie B (JIRA)" <[email protected]>
To:
[email protected]
Date:
03/05/2011 08:57
Subject:
[jira] Commented: (PDFBOX-586) Text Extraction on Android




    [ 
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002998#comment-13002998
 
] 

Eddie B commented on PDFBOX-586:
--------------------------------

I have modified the open source PDFRenderer code to do much the same as 
this code... text extraction on Android devices specifically.
I have run into a limitation though, and PDFBox seems to have the same 
limitation, No support for encrypted documents.
PDFs encrypted with either AES or RC4 are not able to be parsed.  It 
appears to be a limitation of the ciphers that are available in the 
android OS.
The encryption is added when password security is applied to prevent 
editing for example. (in Acrobat: File - Properties - Security)
Has anyone had any luck opening pdfs with AES or RC4 encyption on an 
Android device? I will try to post some small pdfs here for testing.

> Text Extraction on Android
> --------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, 
EncryptedFileTest_AES.pdf, EncryptedFileTest_RC4.pdf, Eval.pdf, 
PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, 
PDFBOX586-internals.txt, TestPDFBox.zip, internals.pdf
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 
but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any 
text, or the extraction is worst.
> Am I the only only one who think there is a regression in text 
extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
> 
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources); 
 // ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = 
resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property 
file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : 
"+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
> 
>    PDFTextStripper stripper = new PDFTextStripper(properties );
> 
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : 
"+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages 
text : "+s);
> Maybe I should use page.getContents().getStream()   or 
stripper.getTextForRegion( "class1" )  or stripper.writeText(doc, 
outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

 



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources  
 for helpful links on Training, Webinars, Lender Alerts and Submitting 
Conditions  
This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call (800) 453 7884.

Re: [jira] Commented: (PDFBOX-586) Text Extraction on Android

Reply via email to