PDFdev is a service provided by PDFzone.com | http://www.pdfzone.com _____________________________________________________________
Leonard, Thanks for responding. I do realize that not all PDFs use /ToUnicode (although their number is significant, approx. 600 out of 2000 in my randomly picked batch of files have it). My goal is to minimize the number of files WITH /ToUnicode that my application chokes on during text extraction. The alleged irregularity I described in the original message was observed in 3 out of 600 PDFs with /ToUnicode, and that is a significant number for me to simply discard it. Would you happen to know the meaning of the word "def" in a Cmap (see original message)? And why would it be placed inside a dictionary? Thanks! Peter ---------- Original Message ---------------------------------- From: Leonard Rosenthol <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Date: Thu, 25 Sep 2003 21:41:14 -0400 > >PDFdev is a service provided by PDFzone.com | http://www.pdfzone.com >_____________________________________________________________ > >At 7:10 PM -0400 9/25/03, Peter Persits wrote: >>An application I am developing is capable of extracting text data >>from almost every PDF document, and for this to happen I have to >>parse a font's ToUnicode stream which contains a CMap. >> > > What do you with the HUNDREDS OF THOUSANDS of PDF's that >don't have a ToUnicode stream? > > >Leonard >-- >--------------------------------------------------------------------------- >Leonard Rosenthol <mailto:[EMAIL PROTECTED]> >Chief Technical Officer <http://www.pdfsages.com> >PDF Sages, Inc. 215-629-3700 (voice) > 215-629-0789 (fax) > >To change your subscription: >http://www.pdfzone.com/discussions/lists-pdfdev.html > > To change your subscription: http://www.pdfzone.com/discussions/lists-pdfdev.html
