[iText-questions] getPageContent Bug ?

sjf Wed, 21 Jun 2006 01:38:49 -0700

> > I download the latest itext and itextsharp and find a bug. If I burst
> > a PDF file into pages and merge them into one PDF file again using
> > pdfsam (http://sourceforge.net/projects/pdfsam), getPageContent will
> > not return the correct content of the remerged PDF file, while
> > ExtractText from PDFbox(www.pdfbox.org <http://www.pdfbox.org>
<http://www.pdfbox.org>) can
> > extract all the text correctly from the same PDF file.
>
> It's not a bug.
> You are mixing two different concepts.
> 1. you DO get the correct content of the remerged PDF,
> but it's different from the content of the original PDF.
> In the merged PDF the content is added as a PDF Form XObject.
> 2. The text extracted with PDFBox is the text that is in the
> Form XObject. PDFBox parses the page content and discovers
> that the real content is in a different object. It gets
> that object to retrieve the text.

Is there any examples about how to get the text in the Form XObject?

A New Bug:

When I use itextsharp to getPageContent, I got an Exception:

System.IO.EndOfStreamException: Trying to read content after the end of the stream
iTextSharp.text.pdf.RandomAccessFileOrArray.ReadFully(Byte[] b, Int32 off, Int32 len)
iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(PRStream stream,RandomAccessFileOrArray file)
iTextSharp.text.pdf.PdfReader.GetStreamBytes(PRStream stream,RandomAccessFileOrArray file)
iTextSharp.text.pdf.PdfReader.GetPageContent(Int32 pageNum,RandomAccessFileOrArray file)

But there IS a picture(and nothing else) in the page and GetImportedPage runs well.

Thanks,
sjf

你不想试试今夏最 “酷” 的邮箱吗？
蕴涵中华传统文化于世界一流科技之中，创新 Ajax 技术，126 “D 计划”火热体验中！

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

[iText-questions] getPageContent Bug ?

Reply via email to