Roberto, Could you, kindly, post what exactly you did to "upgrade libmagic1 file". I have read some posts about changing /etc/magic file with the content of msooxml. I tried to upgrade it by two ways:
# Correct the mimetype with the registered ones: # http://technet.microsoft.com/en-us/library/cc179224.aspx >>>>&26 string word/ Microsoft Word 2007+ !:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document >>>>&26 string ppt/ Microsoft PowerPoint 2007+ !:mime application/vnd.openxmlformats-officedocument.presentationml.presentation >>>>&26 string xl/ Microsoft Excel 2007+ !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet >>>>&26 default x Microsoft OOXML !:strength +10 and this way: # Correct the mimetype with the registered ones: # http://technet.microsoft.com/en-us/library/cc179224.aspx >>>>&26 string word/ Microsoft Word 2007+ !:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document >>>>&26 string ppt/ Microsoft PowerPoint 2007+ !:mime application/vnd.openxmlformats-officedocument.presentationml.presentation >>>>&26 string xl/ Microsoft Excel 2007+ !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet >>>>&26 default x Microsoft OOXML !:strength +10 Here is the output of the file command: $ file testfile.docx /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime application/vnd.openxmlformats-offi' truncated /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime application/vnd.openxmlformat' truncated /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime application/vnd.openxmlformats-off' truncated However, none of that worked. The docx files are still uploaded into Mayan as zip files. I am running Ubuntu 12.04 LTS. I hope you can help me with this issue. On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote: > > During a recent installation of Mayan, wordprocessing documents (.docx) > were being detected as zip/compressed files and OCR was failing on them. > .docx are in fact compressed files containing several XML files. > Upgrading the libmagic1 file allowed the 'file' command to detect the > document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was > able to OCR the documents correctly. This could be one of the causes for > the OCR failure being experienced in the thread. Check to see if the > 'file' command correctly detects the document type. > > This is the current list of file MIME types Mayan will pass to LibreOffice > for conversion to PDF if detected: > https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17 > > > On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote: >> >> Hi Charles, Roberto and Steve, >> >> I'm loving this software, i'm actualy planning to start a business of >> files digitalization for small busines and this software is the one >> i'm liking more. >> >> I'm having the same problem as you two a simple error given by the >> binaries in the ocr cue. >> >> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with >> the same error about "the magic %P", unpaper don't handle pdf !!! So >> Roberto may give us another way to check what is going inside mayan so >> we can simulate it by hand. >> >> As far as i see there's no pdf output file from the "document as an >> image" in the temporary folder, just a file called IHAKtmp which is >> empty so i guess the problem is at the first step which shoud be >> libreoffice jpg to pdf conversion. That may make sense since we are >> all using the same version of unpaper and tesseract and we may no be >> using the same LibreOffice. >> >> I'm in a hurry trying to figure which is the best software for my >> company and I would happly make a donation when i'll have it working >> localy. >> >> >> Also, while trying to solve this issue i've came to this observations: >> >> >> 1 >> -------------------------- >> Tesseract has to have it's language training files in the usr/local/ >> in order to work >> >> like this: >> >> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls >> cat.traineddata eng.cube.fold eng.cube.params >> eng.tesseract_cube.nn >> configs eng.cube.lm eng.cube.size eng.traineddata >> eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs >> >> >> 2 >> -------------------------- >> making tesseract to work with a .jpg from the scan has EXTREMELY >> better results than giving it a ppm "cleaned" by unpaper , in the >> first case only 5 words in a page where mistaken and a cleaned ppm >> tesseract gave only 3 comprensible words in the whole page. No PDF >> (jpg converted via libre office) is accepted by tesseract giving a : >> >> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed >> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >> Error in pixReadStream: Unknown format: no pix returned >> Error in pixRead: pix not read >> Unsupported image type. >> >> 3 >> ---------------------------- >> Having a metadata tag indicating a language in mayan and using this to >> set the language flag of tesseract can improve results a lot ! (50 >> words per page) If my project is finally using mayan i would try to >> program this feature. >> > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
