[CBLX] DOC - convertir des documents PDF en langue Arabe: Gmail and Python as PDF converter | Ayman Hourieh's Blog

Aldo Fri, 15 Dec 2006 01:07:32 -0800

A toutes fins utiles, j'ai trouvé ceci:

URL: http://aymanh.com/archives/2006/01/05/gmail-and-python-as-pdf-converter


Gmail and Python as PDF converter

Extrait:

[...]
   Today I downloaded an Arabic PDF file I wanted to read, but for some
   reason, neither xpdf nor gpdf (Linux PDF readers) could display
   characters correctly, the document was created using "Acrobat
   Distiller 6.0 (Windows)" as its properties page said, perhaps this was
   the reason, anyway, the first thought I had to solve the problem was
   using Google (as it has the ability to display cached PDF files as
   HTML), unfortunately the file wasn't cached by Google, so I had to
   look for another way, I searched for an online PDF-to-HTML converter,
   but all results were down, not free, or useless for some other reason.

   More thinking and I remembered that Gmail recently got the ability to
   display PDF attachments as HTML, I emailed the attachment to myself,
   opened it as HTML, and the problem was partially solved, I could see
   Arabic characters, but they were in reverse order!

   Arabic is a right-to-left language, its characters are usually stored
   in logical order, and logical order is converted to visual order when
   text is displayed, my guess is that the characters were stored in
   logical order in the document, and Gmail didn't convert them to visual
   order before displaying...

   At that point, reading the document had become a challenge for me to
   solve, although I didn't have free time, I decided to hack a quick and
   dirty Python script to reverse the characters, fortunately the
   document only contained Arabic characters so I didn't have to deal
   with bi-directionality, after several minutes of writing and debugging
   the script, I got it to work correctly and produce readable text!

   Finally I realized that the PDF document wasn't worth the effort, but
   I had fun solving the problem!

   Here is the script if anyone is interested:

   #!/usr/bin/env python
   import codecs
   infile = open('input.txt', 'r')
   inlines = infile.readlines()
   outlines = []
   for l in inlines:
   tmp = list(unicode(l, 'utf-8'))
   tmp.reverse()
   outlines.append(''.join(tmp))
   outfile = codecs.open('output.txt', encoding='utf-8', mode='w+')
   outfile.writelines(outlines)

   By the way, Linux PDF readers usually display Arabic text correctly
   (like those created with OpenOffice), but looks like this document
   wasn't created in a standard way or something.

[...]


Aldo.

Title: Gmail and Python as PDF converter | Ayman Hourieh's Blog

Gmail and Python as PDF converter

Today I downloaded an Arabic PDF file I wanted to read, but for some reason, neither xpdf nor gpdf (Linux PDF readers) could display characters correctly, the document was created using "Acrobat Distiller 6.0 (Windows)" as its properties page said, perhaps this was the reason, anyway, the first thought I had to solve the problem was using Google (as it has the ability to display cached PDF files as HTML), unfortunately the file wasn't cached by Google, so I had to look for another way, I searched for an online PDF-to-HTML converter, but all results were down, not free, or useless for some other reason.

More thinking and I remembered that Gmail recently got the ability to display PDF attachments as HTML, I emailed the attachment to myself, opened it as HTML, and the problem was partially solved, I could see Arabic characters, but they were in reverse order!

Arabic is a right-to-left language, its characters are usually stored in logical order, and logical order is converted to visual order when text is displayed, my guess is that the characters were stored in logical order in the document, and Gmail didn't convert them to visual order before displaying...

At that point, reading the document had become a challenge for me to solve, although I didn't have free time, I decided to hack a quick and dirty Python script to reverse the characters, fortunately the document only contained Arabic characters so I didn't have to deal with bi-directionality, after several minutes of writing and debugging the script, I got it to work correctly and produce readable text!

Finally I realized that the PDF document wasn't worth the effort, but I had fun solving the problem!

Here is the script if anyone is interested:

#!/usr/bin/env python
import codecs
infile = open('input.txt', 'r')
inlines = infile.readlines()
outlines = []
for l in inlines:
  tmp = list(unicode(l, 'utf-8'))
  tmp.reverse()
  outlines.append(''.join(tmp))
outfile = codecs.open('output.txt', encoding='utf-8', mode='w+')
outfile.writelines(outlines)

By the way, Linux PDF readers usually display Arabic text correctly (like those created with OpenOffice), but looks like this document wasn't created in a standard way or something.

Trackback URL for this post:

http://aymanh.com/trackback/80

Submitted by Ayman on Thu, 2006/01/05 - 3:26pm.

WOW can't believe how clean and straightforward Python code can get! I definitely should learn this thing! Thanks for publishing the hack. :)

BTW, weren't you able to read the PDF document with Adobe Acrobat Reader for Linux?

reply »

Last time I tried Acrobat Reader for Linux I found it too bloated for my liking, I don't have it installed so I couldn't test.

Any idea if it even supports Arabic?

reply »

Well apparently yes, because I've been able to read many Arabic PDF documents created here on FreeBSD or on a Windows system with Acrobat Reader 5. I assume that you know that I'm running the native Linux version through the kernel-side Linux ABI support.

Though I remember that on one occurrence KPDF, which is pretty lightweight, saved my day when Acrobat Reader complained about not being able to find an Arabic font familiar to Windows users. I guess it was Arabic Transparent.

And about being bloated, I didn't notice any significant impact although the executable is 8 megs and its total memory footprint with a couple 1000 page documents open is near 30000K. :p

reply »

shocran ya Ayman!
it's amazing :)

reply »

Hi ayman, maybe I am stupid, I know jack shit about python, but this thing doesn't pay well!
how do you invoke it? I saved the code as cap.py
chmod +x cap.py

then what?
./cap.py some_arabic_pdf.pdf

reply »

Hi Mazen,

The script is only an ugly hack to reverse characters in each line, it works with txt files, I used Gmail to convert the pdf file to text before applying the script.

As for filenames, they are hard-coded in the script, here is a modified version that takes input and output filenames on the command-line:

#!/usr/bin/env python
import sys
import codecs
if len(sys.argv) != 3:
&nbsp;&nbsp;print 'Usage: %s infile outfile' % sys.argv[0]
&nbsp;&nbsp;sys.exit(-1)
infile = open(sys.argv[1], 'r')
inlines = infile.readlines()
infile.close()
outlines = []
for line in inlines:
&nbsp;&nbsp;tmp = list(unicode(line, 'utf-8'))
&nbsp;&nbsp;tmp.reverse()
outlines.append(''.join(tmp))
outfile = codecs.open(sys.argv[2], encoding='utf-8', mode='w+')
outfile.writelines(outlines)
outfile.close()

A bit more userfriendly, but still an ugly hack...

Hope this helps, but if you're looking for a long term solution, I suggest you research pdf readers or converters other than what I tried.

reply »

Hi,
I have few arabic pdf file that I would like to convert to Microsoft Word (Windows) is there a PDF converter software that support the language Arabic.

Thanks in advance
Damatre

reply »

Nice code! I saved it, maybe I will need to use it at some time! :)

Personally I never create pdfs with the Adobe Acrobat Distiler, ( I use a a small pdf converter on Windows )

Anyone can suggest a less bloated than Adobe pdf viewer for Linux?

John

reply »

xpdf is may favorite.

reply »

Ayman, Can you please help me by simply explaining how to use xpdf as you are familiar with it. Because, when I tried to use it with Arabic pdfs, it results for unrecognized characters. I thing that it needs some addings in some how which i can not deal with.

My Arabic pdfs are readable by adob reader, but when i tried to copy some text from it to a text file, it gives unrecognized characters, and I do not know the reason for this also.

Thanks in advance

reply »

Post new comment

Math Question: What is 6 + 1?: *

Please solve the math problem above and type in the result. e.g. for 1+1, type 2

Your name:

E-mail:

The content of this field is kept private and will not be shown publicly.

Homepage:

Subject:

Comment: *

Allowed HTML tags: <a> <em> <strong> <cite> <strike> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <sup> <sub> <h1> <h2> <h3> <b> <i> <u>
Lines and paragraphs break automatically.
You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

Navigation

Articles

Files

Browse archives

« December 2006
Su	Mo	Tu	We	Th	Fr	Sa
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

XML Feeds

Icons

Member Of

_______________________________________________
Liste de diffusion CarrefourBLinuX 
    [email protected]
    http://lists.freearchive.org/mailman/listinfo/carrefourblinux
Fiches EDU : http://blinuxwiki.pbwiki.com/FichesEdu
Signets : http://fr.groups.yahoo.com/group/carrefourblinux/links/
Archives : http://lists.freearchive.org/pipermail//carrefourblinux
Anciennes archives (Yahoogroupes) :
    http://fr.groups.yahoo.com/group/carrefourblinux/messages
Rechercher : http://lists.freearchive.org/cgi-bin/search.cgi
Pour s'inscire par courriel : 
    'mailto:[EMAIL PROTECTED]'
Pour se desinscrire par courriel : 
    'mailto:[EMAIL PROTECTED]'

[CBLX] DOC - convertir des documents PDF en langue Arabe: Gmail and Python as PDF converter | Ayman Hourieh's Blog

Ayman Hourieh's Blog

Gmail and Python as PDF converter

Trackback URL for this post:

Post new comment

Navigation

Personal

Articles

Files

Blog Tags

Browse archives

XML Feeds

Icons

Member Of

Répondre à