[jira] Commented: (PDFBOX-586) Text Extraction on Android

JIRA Fri, 25 Jun 2010 05:32:25 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882548#action_12882548
 ]


Andreas Lehmkühler commented on PDFBOX-586:
-------------------------------------------

possibly uneeded jars:
ICU4J is only needed if you want to support RTL-languages like arabic. The 
bouncycastle jars are needed fpr encryption/decryption.

resources:
A lot of the cmap files aren't yet used, I'm sure some of them will never be 
used. However, it think in most cases it's save to delete the cmap files from 
removed_cmaps.jar, see PDFBOX-494 for further details


> Text Extraction on Android
> --------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, Eval.pdf, internals.pdf, 
> PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, 
> PDFBOX586-internals.txt
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for 
> the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the 
> extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
>   
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // 
> ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = 
> resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
>               
>    PDFTextStripper stripper = new PDFTextStripper(properties );
>               
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text 
> : "+s);
> Maybe I should use page.getContents().getStream()   or 
> stripper.getTextForRegion( "class1" )  or      stripper.writeText(doc, 
> outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-586) Text Extraction on Android

Reply via email to