A toutes fins utiles, j'ai trouvé ceci: URL: http://aymanh.com/archives/2006/01/05/gmail-and-python-as-pdf-converter
Gmail and Python as PDF converter
Extrait:
[...]
Today I downloaded an Arabic PDF file I wanted to read, but for some
reason, neither xpdf nor gpdf (Linux PDF readers) could display
characters correctly, the document was created using "Acrobat
Distiller 6.0 (Windows)" as its properties page said, perhaps this was
the reason, anyway, the first thought I had to solve the problem was
using Google (as it has the ability to display cached PDF files as
HTML), unfortunately the file wasn't cached by Google, so I had to
look for another way, I searched for an online PDF-to-HTML converter,
but all results were down, not free, or useless for some other reason.
More thinking and I remembered that Gmail recently got the ability to
display PDF attachments as HTML, I emailed the attachment to myself,
opened it as HTML, and the problem was partially solved, I could see
Arabic characters, but they were in reverse order!
Arabic is a right-to-left language, its characters are usually stored
in logical order, and logical order is converted to visual order when
text is displayed, my guess is that the characters were stored in
logical order in the document, and Gmail didn't convert them to visual
order before displaying...
At that point, reading the document had become a challenge for me to
solve, although I didn't have free time, I decided to hack a quick and
dirty Python script to reverse the characters, fortunately the
document only contained Arabic characters so I didn't have to deal
with bi-directionality, after several minutes of writing and debugging
the script, I got it to work correctly and produce readable text!
Finally I realized that the PDF document wasn't worth the effort, but
I had fun solving the problem!
Here is the script if anyone is interested:
#!/usr/bin/env python
import codecs
infile = open('input.txt', 'r')
inlines = infile.readlines()
outlines = []
for l in inlines:
tmp = list(unicode(l, 'utf-8'))
tmp.reverse()
outlines.append(''.join(tmp))
outfile = codecs.open('output.txt', encoding='utf-8', mode='w+')
outfile.writelines(outlines)
By the way, Linux PDF readers usually display Arabic text correctly
(like those created with OpenOffice), but looks like this document
wasn't created in a standard way or something.
[...]
Aldo.
Title: Gmail and Python as PDF converter | Ayman Hourieh's Blog
Ayman Hourieh's Blog
Gmail and Python as PDF converter
Today I downloaded an Arabic PDF file I wanted to read, but for some reason, neither xpdf nor gpdf (Linux PDF readers) could display characters correctly, the document was created using "Acrobat Distiller 6.0 (Windows)" as its properties page said, perhaps this was the reason, anyway, the first thought I had to solve the problem was using Google (as it has the ability to display cached PDF files as HTML), unfortunately the file wasn't cached by Google, so I had to look for another way, I searched for an online PDF-to-HTML converter, but all results were down, not free, or useless for some other reason.
More thinking and I remembered that Gmail recently got the ability to display PDF attachments as HTML, I emailed the attachment to myself, opened it as HTML, and the problem was partially solved, I could see Arabic characters, but they were in reverse order!
Arabic is a right-to-left language, its characters are usually stored in logical order, and logical order is converted to visual order when text is displayed, my guess is that the characters were stored in logical order in the document, and Gmail didn't convert them to visual order before displaying...
At that point, reading the document had become a challenge for me to solve, although I didn't have free time, I decided to hack a quick and dirty Python script to reverse the characters, fortunately the document only contained Arabic characters so I didn't have to deal with bi-directionality, after several minutes of writing and debugging the script, I got it to work correctly and produce readable text!
Finally I realized that the PDF document wasn't worth the effort, but I had fun solving the problem!
Here is the script if anyone is interested:
#!/usr/bin/env python
import codecs
infile = open('input.txt', 'r')
inlines = infile.readlines()
outlines = []
for l in inlines:
tmp = list(unicode(l, 'utf-8'))
tmp.reverse()
outlines.append(''.join(tmp))
outfile = codecs.open('output.txt', encoding='utf-8', mode='w+')
outfile.writelines(outlines)By the way, Linux PDF readers usually display Arabic text correctly (like those created with OpenOffice), but looks like this document wasn't created in a standard way or something.
Trackback URL for this post:
Ayman's blog | printer friendly version | |
|
|
|
|
|
|
|
|
| 1458 reads
Tags: Coding | Internet | Linux | OpenSource
Ayman | Last time I tried Acrobat | Sat, 2006/01/07 - 8:43pm
Last time I tried Acrobat Reader for Linux I found it too bloated for my liking, I don't have it installed so I couldn't test.
Any idea if it even supports Arabic?
strontium90 (not verified) | Well apparently yes, because | Sun, 2006/01/08 - 12:04am
Well apparently yes, because I've been able to read many Arabic PDF documents created here on FreeBSD or on a Windows system with Acrobat Reader 5. I assume that you know that I'm running the native Linux version through the kernel-side Linux ABI support.
Though I remember that on one occurrence KPDF, which is pretty lightweight, saved my day when Acrobat Reader complained about not being able to find an Arabic font familiar to Windows users. I guess it was Arabic Transparent.
And about being bloated, I didn't notice any significant impact although the executable is 8 megs and its total memory footprint with a couple 1000 page documents open is near 30000K. :p
Anonymous (not verified) | shocran ya Ayman! it's | Fri, 2006/01/06 - 2:01am
shocran ya Ayman!
it's amazing :)
Mazen (not verified) | Hi ayman, maybe I am stupid, | Tue, 2006/01/24 - 7:01am
Hi ayman, maybe I am stupid, I know jack shit about python, but this thing doesn't pay well!
how do you invoke it? I saved the code as cap.py
chmod +x cap.py
then what?
./cap.py some_arabic_pdf.pdf
Ayman | Hi Mazen, The script is only | Tue, 2006/01/24 - 6:17pm
Hi Mazen,
The script is only an ugly hack to reverse characters in each line, it works with txt files, I used Gmail to convert the pdf file to text before applying the script.
As for filenames, they are hard-coded in the script, here is a modified version that takes input and output filenames on the command-line:
#!/usr/bin/env python
import sys
import codecs
if len(sys.argv) != 3:
print 'Usage: %s infile outfile' % sys.argv[0]
sys.exit(-1)
infile = open(sys.argv[1], 'r')
inlines = infile.readlines()
infile.close()
outlines = []
for line in inlines:
tmp = list(unicode(line, 'utf-8'))
tmp.reverse()
outlines.append(''.join(tmp))
outfile = codecs.open(sys.argv[2], encoding='utf-8', mode='w+')
outfile.writelines(outlines)
outfile.close()A bit more userfriendly, but still an ugly hack...
Hope this helps, but if you're looking for a long term solution, I suggest you research pdf readers or converters other than what I tried.
Damatre (not verified) | Arabic PDF Converter | Thu, 2006/03/23 - 6:18am
Hi,
I have few arabic pdf file that I would like to convert to Microsoft Word (Windows) is there a PDF converter software that support the language Arabic.
Thanks in advance
Damatre
John (not verified) | Nice code! I saved it, maybe | Tue, 2006/08/01 - 11:02pm
Nice code! I saved it, maybe I will need to use it at some time! :)
Personally I never create pdfs with the Adobe Acrobat Distiler, ( I use a a small pdf converter on Windows )
Anyone can suggest a less bloated than Adobe pdf viewer for Linux?
John
Mike (not verified) | xpdf | Mon, 2006/10/23 - 3:48am
Ayman, Can you please help me by simply explaining how to use xpdf as you are familiar with it. Because, when I tried to use it with Arabic pdfs, it results for unrecognized characters. I thing that it needs some addings in some how which i can not deal with.
My Arabic pdfs are readable by adob reader, but when i tried to copy some text from it to a text file, it gives unrecognized characters, and I do not know the reason for this also.
Thanks in advance
Post new comment
- Allowed HTML tags: <a> <em> <strong> <cite> <strike> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <sup> <sub> <h1> <h2> <h3> <b> <i> <u>
- Lines and paragraphs break automatically.
- You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
Navigation
Articles
- Checklist for Securing PHP Configuration
- 9 _javascript_ Tips You May Not Know
- Increase Your Linux/Unix Productivity With GNU/Screen
- A Collection of Vim Tips
- Drag/Drop Portal Interface With Scriptaculous And Drupal
- TurboGears Tutorial: Social Bookmarking Application
- Tips to Secure Linux Workstation
- Remote Inclusion In PHP
- Subversion - A Quick Tutorial
Blog Tags
Browse archives
| Su | Mo | Tu | We | Th | Fr | Sa |
|---|---|---|---|---|---|---|
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| 31 |
XML Feeds
Member Of
_______________________________________________
Liste de diffusion CarrefourBLinuX
[email protected]
http://lists.freearchive.org/mailman/listinfo/carrefourblinux
Fiches EDU : http://blinuxwiki.pbwiki.com/FichesEdu
Signets : http://fr.groups.yahoo.com/group/carrefourblinux/links/
Archives : http://lists.freearchive.org/pipermail//carrefourblinux
Anciennes archives (Yahoogroupes) :
http://fr.groups.yahoo.com/group/carrefourblinux/messages
Rechercher : http://lists.freearchive.org/cgi-bin/search.cgi
Pour s'inscire par courriel :
'mailto:[EMAIL PROTECTED]'
Pour se desinscrire par courriel :
'mailto:[EMAIL PROTECTED]'

strontium90 (not verified) | WOW can't believe how clean | Fri, 2006/01/06 - 12:10am
WOW can't believe how clean and straightforward Python code can get! I definitely should learn this thing! Thanks for publishing the hack. :)
BTW, weren't you able to read the PDF document with Adobe Acrobat Reader for Linux?
reply »