[jira] Commented: (PDFBOX-586) Text Extraction on Android

JIRA Tue, 01 Jun 2010 11:16:02 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874133#action_12874133
 ]


Andreas Lehmkühler commented on PDFBOX-586:
-------------------------------------------

It seems that this discussion leads to some sort of  modularization of pdfbox. 

Probably we should think about different modules like:

- core (as main module)
- textextraction
- rendering
- search
- examples
- ...

@Bernard
I'm using PDFBox in an other project and I've also minimized the dependencies 
by deleting uneeded code. As a good start you should

- delete org.apache.pdfbox.examples.*
- delete all classes used for testing
- delete org.apache.pdfbox.ant.* (no more runtime dependency on ant)

if you don't need lucene integration:
- delete org.apache.pdfbox.searchengine.* (no more runtime dependency on lucene)

I've also removed the dependecy on the encryption jars, but I can't remember 
what classes exactly have to be modified. All I know is, that it wasn't that 
hard.

If you don't need rendering:
- delete org.apache.pdfbox.util.operator.pagedrawer.*
- remove all classes from org.apache.pdfbox.util.operator.* which aren't needed 
for textextraction (see PDFTextStripper.properties)
- delete org.apache.pdfbox.pdfviewer.* (because of the missing class PageDrawer 
you have to remove convertToIamge and print from PDPage)
- remove org.apache.pdfbox.PDFDebugger
- remove org.apache.pdfbox.PDFReader
- remove org.apache.pdfbox.PDFToImage
- ....

If you removed all of that, it should be easier to move on and to identify more 
classes which aren't used anymore after deleting all the stuff above.

> Text Extraction on Android
> --------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, Eval.pdf, internals.pdf, 
> PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, 
> PDFBOX586-internals.txt
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for 
> the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the 
> extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
>   
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // 
> ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = 
> resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
>               
>    PDFTextStripper stripper = new PDFTextStripper(properties );
>               
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text 
> : "+s);
> Maybe I should use page.getContents().getStream()   or 
> stripper.getTextForRegion( "class1" )  or      stripper.writeText(doc, 
> outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-586) Text Extraction on Android

Reply via email to