> > I download the latest itext and itextsharp and find a bug. If I burst
> > a PDF file into pages and merge them into one PDF file again using
> > pdfsam (http://sourceforge.net/projects/pdfsam), getPageContent will
> > not return the correct content of the remerged PDF file, while
> > ExtractText from PDFbox(www.pdfbox.org <http://www.pdfbox.org>
<http://www.pdfbox.org>) can
> > extract all the text correctly from the same PDF file.
>
> It's not a bug.
> You are mixing two different concepts.
> 1. you DO get the correct content of the remerged PDF,
> but it's different from the content of the original PDF.
> In the merged PDF the content is added as a PDF Form XObject.
> 2. The text extracted with PDFBox is the text that is in the
> Form XObject. PDFBox parses the page content and discovers
> that the real content is in a different object. It gets
> that object to retrieve the text.
 
Is there any examples about how to get the text in the Form XObject?
 
A New Bug:
 
When I use itextsharp to getPageContent, I got an Exception:
 
System.IO.EndOfStreamException: Trying to read content after the end of the stream
iTextSharp.text.pdf.RandomAccessFileOrArray.ReadFully(Byte[] b, Int32 off, Int32 len)
iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw(PRStream stream,RandomAccessFileOrArray file)
iTextSharp.text.pdf.PdfReader.GetStreamBytes(PRStream stream,RandomAccessFileOrArray file)
iTextSharp.text.pdf.PdfReader.GetPageContent(Int32 pageNum,RandomAccessFileOrArray file)
 
But there IS a picture(and nothing else) in the page and GetImportedPage runs well.
 
Thanks,
sjf
 
 







你 不 想 试 试 今 夏 最 “酷” 的 邮 箱 吗 ?
蕴 涵 中 华 传 统 文 化 于 世 界 一 流 科 技 之 中,创 新 Ajax 技 术,126 “D 计 划”火 热 体 验 中 !
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Reply via email to