-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Michael,
I am unsure whether I completely understand your problem, but if this is about the mapping from logical to visual text order, the bug and patch at [1] might interest you. (And may be also [2].) But as I said, I am not really sure whether you mean that the extracted text has encoding or ordering problems. Best regards, Adam. [1] https://bugs.freedesktop.org/show_bug.cgi?id=55977 [2] https://bugs.freedesktop.org/show_bug.cgi?id=2981 On 31.10.2012 18:28, Michael Younkin wrote: > Hello, > > We have been doing some work using Poppler's pdftotext tool with > the -html option to extract text with bounding box coordinates from > PDF files. Later on we match up these pieces of text and > coordinates with versions of the PDF files converted to images. > > We are working with multiple languages, but right now we are > focusing on Arabic. We are having a couple of problems with the > encodings of Arabic characters. Sometimes all of the Unicode code > points will be in the wrong order, and other times some characters > have their code points backwards and some do not. > > According to our Arabic speaking Annotators, when we render the > images the text appears correct, but when text from the pdftotext > tool is matched against these renderings, we encounter the problems > I stated above. > > I imagine that these issues are related to how PDF files are > encoded and little to do with how pdftotext is extracting the text > from PDF files. Does anyone have any suggestions for dealing with > these issues? We can resolve some of them manually pretty quickly, > but sometimes when the code points are in a seemingly random order > all we can do is retype them, which is very time consuming as we > are hoping to process hundreds of PDF file pages. > > Could someone also point me to where in the poppler code characters > get extracted from the PDF file? We don't really know if it is just > how we are using pdftotext that is causing the issues, if there is > something that could be improved in the code, or if there is simply > nothing that can be done. We have done some research and found that > Apache's PDFBox can correct some of the issues we have been facing, > but we are still investigating the code to see what they are doing > to fix the problems. > > Thank you very much for your help! > > Michael Younkin > > > _______________________________________________ poppler mailing > list [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iQEcBAEBAgAGBQJQkWIVAAoJEPSSjE3STU34imoIAIeoMYdIsB99itk7LXaoQwoy vXp9J9E29bdeKbbKPIgvpRdav3Z+mx7hhGEpMHmiw+CS7DvKeHIrQqSHKzNxtKBi 5tyWbFIMV8CzsA/AUhfB/zRqcdaK+e/3puMnTUeT4nHL0uaYrVJIPQqTXT7IWqrK CIxRvIjjnag7rLgjYFlymIAc3XSQwBcZhvOch2BQxp7kxwfdMoW7xLiSmSZSVjTn xIUadWAl7gSBRFgHPLKSMf07YoLwxDi6AntAyf+/Y9Xo+Ih+Mx0tlJFZ5E5T/z9U a8F9htRUkfhbHew8NFAYySq9FPHfQ4sJqdYoJR9SvATsq1Pd2BU1WuvfoqDNoDU= =/YsJ -----END PGP SIGNATURE----- _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
