Correction: Changing /etc/magic file as I described fix uploading files made by MS Office. However, docx files created by Libre Office are still recognized as zip files. I would love to know how /etc/magic has to be modified so that docx, xlsx, pptx files created by Libre Office would also be properly recognized.
On Sunday, August 4, 2013 9:58:34 AM UTC-4, Alek Geldenberg wrote: > > Roberto, > > Could you, kindly, post what exactly you did to "upgrade libmagic1 file". > I have read some posts about changing /etc/magic file with the content of > msooxml. I tried to upgrade it by two ways: > > # Correct the mimetype with the registered ones: > # http://technet.microsoft.com/en-us/library/cc179224.aspx > >>>>&26 string word/ Microsoft Word 2007+ > !:mime > application/vnd.openxmlformats-officedocument.wordprocessingml.document > >>>>&26 string ppt/ Microsoft PowerPoint 2007+ > !:mime > application/vnd.openxmlformats-officedocument.presentationml.presentation > >>>>&26 string xl/ Microsoft Excel 2007+ > !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > >>>>&26 default x Microsoft OOXML > !:strength +10 > > > and this way: > > # Correct the mimetype with the registered ones: > # http://technet.microsoft.com/en-us/library/cc179224.aspx > >>>>&26 string word/ Microsoft Word 2007+ > !:mime > application/vnd.openxmlformats-officedocument.wordprocessingml.document > >>>>&26 string ppt/ Microsoft PowerPoint 2007+ > !:mime > application/vnd.openxmlformats-officedocument.presentationml.presentation > >>>>&26 string xl/ Microsoft Excel 2007+ > !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > >>>>&26 default x Microsoft OOXML !:strength > +10 > > > > Here is the output of the file command: > $ file testfile.docx > > /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime > application/vnd.openxmlformats-offi' truncated > /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime > application/vnd.openxmlformat' truncated > /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime > application/vnd.openxmlformats-off' truncated > > However, none of that worked. The docx files are still uploaded into > Mayan as zip files. > > I am running Ubuntu 12.04 LTS. > > > I hope you can help me with this issue. > > > On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote: >> >> During a recent installation of Mayan, wordprocessing documents (.docx) >> were being detected as zip/compressed files and OCR was failing on them. >> .docx are in fact compressed files containing several XML files. >> Upgrading the libmagic1 file allowed the 'file' command to detect the >> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was >> able to OCR the documents correctly. This could be one of the causes for >> the OCR failure being experienced in the thread. Check to see if the >> 'file' command correctly detects the document type. >> >> This is the current list of file MIME types Mayan will pass to >> LibreOffice for conversion to PDF if detected: >> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17 >> >> >> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote: >>> >>> Hi Charles, Roberto and Steve, >>> >>> I'm loving this software, i'm actualy planning to start a business of >>> files digitalization for small busines and this software is the one >>> i'm liking more. >>> >>> I'm having the same problem as you two a simple error given by the >>> binaries in the ocr cue. >>> >>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with >>> the same error about "the magic %P", unpaper don't handle pdf !!! So >>> Roberto may give us another way to check what is going inside mayan so >>> we can simulate it by hand. >>> >>> As far as i see there's no pdf output file from the "document as an >>> image" in the temporary folder, just a file called IHAKtmp which is >>> empty so i guess the problem is at the first step which shoud be >>> libreoffice jpg to pdf conversion. That may make sense since we are >>> all using the same version of unpaper and tesseract and we may no be >>> using the same LibreOffice. >>> >>> I'm in a hurry trying to figure which is the best software for my >>> company and I would happly make a donation when i'll have it working >>> localy. >>> >>> >>> Also, while trying to solve this issue i've came to this observations: >>> >>> >>> 1 >>> -------------------------- >>> Tesseract has to have it's language training files in the usr/local/ >>> in order to work >>> >>> like this: >>> >>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls >>> cat.traineddata eng.cube.fold eng.cube.params >>> eng.tesseract_cube.nn >>> configs eng.cube.lm eng.cube.size eng.traineddata >>> eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs >>> >>> >>> 2 >>> -------------------------- >>> making tesseract to work with a .jpg from the scan has EXTREMELY >>> better results than giving it a ppm "cleaned" by unpaper , in the >>> first case only 5 words in a page where mistaken and a cleaned ppm >>> tesseract gave only 3 comprensible words in the whole page. No PDF >>> (jpg converted via libre office) is accepted by tesseract giving a : >>> >>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed >>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >>> Error in pixReadStream: Unknown format: no pix returned >>> Error in pixRead: pix not read >>> Unsupported image type. >>> >>> 3 >>> ---------------------------- >>> Having a metadata tag indicating a language in mayan and using this to >>> set the language flag of tesseract can improve results a lot ! (50 >>> words per page) If my project is finally using mayan i would try to >>> program this feature. >>> >> -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
