[jira] Commented: (PDFBOX-586) Text Extraction on Android

Bernard (JIRA) Mon, 02 Aug 2010 01:03:44 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894485#action_12894485
 ]


Bernard commented on PDFBOX-586:
--------------------------------

The Android executable is big : more than 6Mo !  for an embeded app. and taking 
into account that the app can do other things than only extracting text...  it 
may be good to reduce binary file size.

Some optimisation can be done in .java  : remove all parts/code/lines not 
concerning text extraction (colors, drawing, etc) It is a slow process & 
requires a lot of non-reg. tests.

It seems that files like :
ar_eg.xml
adobe_cns1_3
kscms_uhc_h
unicns_utf8_h
V
uniks_utf8_v
...
are not used (I didn't have to change any code in the PDFBox sources, and 
extraction still works...).   Can they be removed ?


Note: I didn't try all languages/fonts.  I only tested on 20 pdf files 
(european, russian, vietnamese)

> Text Extraction on Android
> --------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>             Fix For: 1.1.0
>
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, Eval.pdf, internals.pdf, 
> PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, 
> PDFBOX586-internals.txt, TestPDFBox.zip
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for 
> the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the 
> extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
>   
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // 
> ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = 
> resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
>               
>    PDFTextStripper stripper = new PDFTextStripper(properties );
>               
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text 
> : "+s);
> Maybe I should use page.getContents().getStream()   or 
> stripper.getTextForRegion( "class1" )  or      stripper.writeText(doc, 
> outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-586) Text Extraction on Android

Reply via email to