Hi Alek, I imagine Roberto meant upgrading libmagic1 proper, not actually modifying the /etc/magic config file which to my understanding is specific to the file(1) command. Which version of libmagic1 are you using?
I seem to have MS Office documents being correctly detected with version 5.11-2. On Sunday, 4 August 2013 16:03:55 UTC+2, Alek Geldenberg wrote: > > Correction: > > Changing /etc/magic file as I described fix uploading files made by MS > Office. However, docx files created by Libre Office are still recognized > as zip files. I would love to know how /etc/magic has to be modified so > that docx, xlsx, pptx files created by Libre Office would also be properly > recognized. > > On Sunday, August 4, 2013 9:58:34 AM UTC-4, Alek Geldenberg wrote: >> >> Roberto, >> >> Could you, kindly, post what exactly you did to "upgrade libmagic1 >> file". I have read some posts about changing /etc/magic file with the >> content of msooxml. I tried to upgrade it by two ways: >> >> # Correct the mimetype with the registered ones: >> # http://technet.microsoft.com/en-us/library/cc179224.aspx >> >>>>&26 string word/ Microsoft Word 2007+ >> !:mime >> application/vnd.openxmlformats-officedocument.wordprocessingml.document >> >>>>&26 string ppt/ Microsoft PowerPoint 2007+ >> !:mime >> application/vnd.openxmlformats-officedocument.presentationml.presentation >> >>>>&26 string xl/ Microsoft Excel 2007+ >> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet >> >>>>&26 default x Microsoft OOXML >> !:strength +10 >> >> >> and this way: >> >> # Correct the mimetype with the registered ones: >> # http://technet.microsoft.com/en-us/library/cc179224.aspx >> >>>>&26 string word/ Microsoft Word 2007+ >> !:mime >> application/vnd.openxmlformats-officedocument.wordprocessingml.document >> >>>>&26 string ppt/ Microsoft PowerPoint >> 2007+ !:mime >> application/vnd.openxmlformats-officedocument.presentationml.presentation >> >>>>&26 string xl/ Microsoft Excel 2007+ >> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet >> >>>>&26 default x Microsoft OOXML >> !:strength +10 >> >> >> >> Here is the output of the file command: >> $ file testfile.docx >> >> /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime >> application/vnd.openxmlformats-offi' truncated >> /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime >> application/vnd.openxmlformat' truncated >> /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime >> application/vnd.openxmlformats-off' truncated >> >> However, none of that worked. The docx files are still uploaded into >> Mayan as zip files. >> >> I am running Ubuntu 12.04 LTS. >> >> >> I hope you can help me with this issue. >> >> >> On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote: >>> >>> During a recent installation of Mayan, wordprocessing documents (.docx) >>> were being detected as zip/compressed files and OCR was failing on them. >>> .docx are in fact compressed files containing several XML files. >>> Upgrading the libmagic1 file allowed the 'file' command to detect the >>> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was >>> able to OCR the documents correctly. This could be one of the causes for >>> the OCR failure being experienced in the thread. Check to see if the >>> 'file' command correctly detects the document type. >>> >>> This is the current list of file MIME types Mayan will pass to >>> LibreOffice for conversion to PDF if detected: >>> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17 >>> >>> >>> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote: >>>> >>>> Hi Charles, Roberto and Steve, >>>> >>>> I'm loving this software, i'm actualy planning to start a business of >>>> files digitalization for small busines and this software is the one >>>> i'm liking more. >>>> >>>> I'm having the same problem as you two a simple error given by the >>>> binaries in the ocr cue. >>>> >>>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with >>>> the same error about "the magic %P", unpaper don't handle pdf !!! So >>>> Roberto may give us another way to check what is going inside mayan so >>>> we can simulate it by hand. >>>> >>>> As far as i see there's no pdf output file from the "document as an >>>> image" in the temporary folder, just a file called IHAKtmp which is >>>> empty so i guess the problem is at the first step which shoud be >>>> libreoffice jpg to pdf conversion. That may make sense since we are >>>> all using the same version of unpaper and tesseract and we may no be >>>> using the same LibreOffice. >>>> >>>> I'm in a hurry trying to figure which is the best software for my >>>> company and I would happly make a donation when i'll have it working >>>> localy. >>>> >>>> >>>> Also, while trying to solve this issue i've came to this observations: >>>> >>>> >>>> 1 >>>> -------------------------- >>>> Tesseract has to have it's language training files in the usr/local/ >>>> in order to work >>>> >>>> like this: >>>> >>>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls >>>> cat.traineddata eng.cube.fold eng.cube.params >>>> eng.tesseract_cube.nn >>>> configs eng.cube.lm eng.cube.size eng.traineddata >>>> eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs >>>> >>>> >>>> 2 >>>> -------------------------- >>>> making tesseract to work with a .jpg from the scan has EXTREMELY >>>> better results than giving it a ppm "cleaned" by unpaper , in the >>>> first case only 5 words in a page where mistaken and a cleaned ppm >>>> tesseract gave only 3 comprensible words in the whole page. No PDF >>>> (jpg converted via libre office) is accepted by tesseract giving a : >>>> >>>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed >>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >>>> Error in pixReadStream: Unknown format: no pix returned >>>> Error in pixRead: pix not read >>>> Unsupported image type. >>>> >>>> 3 >>>> ---------------------------- >>>> Having a metadata tag indicating a language in mayan and using this to >>>> set the language flag of tesseract can improve results a lot ! (50 >>>> words per page) If my project is finally using mayan i would try to >>>> program this feature. >>>> >>> -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
