[
https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12873972#action_12873972
]
Bernard edited comment on PDFBOX-586 at 6/1/10 8:26 AM:
--------------------------------------------------------
Hi,
I have just tried the .jar on 3 'bad' PDF and it works fine. I wonder why the
sources (from .zip) didn't...
As I have not IBM (encryption) lib (cold you add a link to them it the download
page ?) I had to comment all that.
But : I don't care about encrypted PDF for now.
I have also commented Bi-Di text handling : no need yet. And I have not the
lib. source.
As I run on Android, I had to comment 20% of Font/awt related stuff. I need
the characters, I don't care about viewing the PDF page.
After all those change my PDF were successfuly opened but PDFBox 0.7.3, but
some PDF didn't work with PDFBox 1.1.0.
I will continue investigating.... (and not working on my app :-( :
http://bsegonnes.free.fr/multireader/en_multireader.html
was (Author: bsegonnes):
Hi,
I have just tried the .jar on 3 'bad' PDF and it works fine. I wonder why the
sources (from .zip) didn't...
As I have not IBM (encryption) lib (cold you add a link to them it the download
page ?) I had to comment all that.
But : I don't care about encrypted PDF for now.
I have also commented Bi-Di text handling : no need yet. And I have not the
lib. source.
As I run on Android, I had to comment 20% of Font/awt related stuff. I need
the characters, I don't care about viewing the PDF page.
After all those change my PDF were successfuly opened but PDFBox 0.7.3, but
some PDF didn't work with PDFBox 1.1.0.
I will continue investigating.... (and not working on my app :-(
> Text Extraction Regression ?
> ----------------------------
>
> Key: PDFBOX-586
> URL: https://issues.apache.org/jira/browse/PDFBOX-586
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.1.0
> Environment: Windows XP + Eclipse + PDFBox sources
> Reporter: Bernard
> Attachments: ASEB-Camping_Car_ou_Bateau.pdf, Eval.pdf, internals.pdf,
> PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt,
> PDFBOX586-internals.txt
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for
> the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the
> extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
> PDDocument document = PDDocument.load("/sdcard/internals.pdf");
> int numberOfPages = document.getNumberOfPages();
> resources = this.getResources();
>
> android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources); //
> ANDROID code here to get file
> resourceGlyphList = R.raw.glyphlist;
> InputStream rawResource =
> resources.openRawResource(R.raw.pdftextstripper); // PDFBOX property file
> android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
> Properties properties = new Properties();
> properties.load(rawResource);
>
> PDFTextStripper stripper = new PDFTextStripper(properties );
>
> stripper.setStartPage(pageNumber ); // 1 or any other page
> stripper.setEndPage(pageNumber ); // same page as above
> String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
> android.util.Log.d(TEST_PDFBOX, "readerPDF() stripper extract pages text
> : "+s);
> Maybe I should use page.getContents().getStream() or
> stripper.getTextForRegion( "class1" ) or stripper.writeText(doc,
> outputStream)
> I want the text as a String, not as a newly created file....
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.