[Mayan EDMS: 606] Re: OCR error on .doc files

Youri Lacan-Bartley Mon, 05 Aug 2013 02:49:24 -0700

Hi Alek,

I imagine Roberto meant upgrading libmagic1 proper, not actually modifying 
the /etc/magic config file which to my understanding is specific to the 
file(1) command.
Which version of libmagic1 are you using?


I seem to have MS Office documents being correctly detected with version 
5.11-2.

On Sunday, 4 August 2013 16:03:55 UTC+2, Alek Geldenberg wrote:
>
> Correction:
>
> Changing /etc/magic file as I described fix uploading files made by MS 
> Office.  However, docx files created by Libre Office are still recognized 
> as zip files.  I would love to know how /etc/magic has to be modified so 
> that docx, xlsx, pptx files created by Libre Office would also be properly 
> recognized.
>
> On Sunday, August 4, 2013 9:58:34 AM UTC-4, Alek Geldenberg wrote:
>>
>> Roberto,
>>
>> Could you, kindly, post what exactly you did to "upgrade libmagic1 
>> file".  I have read some posts about changing /etc/magic file with the 
>> content of msooxml.  I tried to upgrade it by two ways:
>>
>> #   Correct the mimetype with the registered ones:
>> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
>> >>>>&26         string          word/           Microsoft Word 2007+
>> !:mime 
>> application/vnd.openxmlformats-officedocument.wordprocessingml.document
>> >>>>&26         string          ppt/            Microsoft PowerPoint 2007+
>> !:mime 
>> application/vnd.openxmlformats-officedocument.presentationml.presentation
>> >>>>&26         string          xl/             Microsoft Excel 2007+
>> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>> >>>>&26         default         x               Microsoft OOXML
>> !:strength +10
>>
>>
>> and this way:
>>
>> #   Correct the mimetype with the registered ones:
>> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
>> >>>>&26         string          word/           Microsoft Word 2007+ 
>> !:mime 
>> application/vnd.openxmlformats-officedocument.wordprocessingml.document
>> >>>>&26         string          ppt/            Microsoft PowerPoint 
>> 2007+ !:mime 
>> application/vnd.openxmlformats-officedocument.presentationml.presentation
>> >>>>&26         string          xl/             Microsoft Excel 2007+ 
>> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>> >>>>&26         default         x               Microsoft OOXML 
>> !:strength +10
>>
>>
>>
>> Here is the output of the file command:
>> $ file testfile.docx
>>
>> /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime 
>> application/vnd.openxmlformats-offi' truncated
>> /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime 
>> application/vnd.openxmlformat' truncated
>> /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime 
>> application/vnd.openxmlformats-off' truncated
>>
>> However, none of that worked.  The docx files are still uploaded into 
>> Mayan as zip files.
>>
>> I am running Ubuntu 12.04 LTS.
>>
>>
>> I hope you can help me with this issue.
>>
>>
>> On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote:
>>>
>>> During a recent installation of Mayan, wordprocessing documents (.docx) 
>>> were being detected as zip/compressed files and OCR was failing on them. 
>>>  .docx are in fact compressed files containing several XML files. 
>>>  Upgrading the libmagic1 file allowed the 'file' command to detect the 
>>> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was 
>>> able to OCR the documents correctly.  This could be one of the causes for 
>>> the OCR failure being experienced in the thread.  Check to see if the 
>>> 'file' command correctly detects the document type.  
>>>
>>> This is the current list of file MIME types Mayan will pass to 
>>> LibreOffice for conversion to PDF if detected: 
>>> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17
>>>
>>>
>>> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:
>>>>
>>>> Hi Charles, Roberto and Steve, 
>>>>
>>>> I'm loving this software, i'm actualy planning to start a business of 
>>>> files digitalization for small busines and this software is the one 
>>>> i'm liking more. 
>>>>
>>>> I'm having the same problem as you two a simple error given by the 
>>>> binaries in the ocr cue. 
>>>>
>>>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with 
>>>> the same error about  "the magic %P", unpaper don't handle pdf !!! So 
>>>> Roberto may give us another way to check what is going inside mayan so 
>>>> we can simulate it by hand. 
>>>>
>>>> As far as i see there's no pdf output file from the "document as an 
>>>> image" in the temporary folder, just a file called IHAKtmp which is 
>>>> empty so i guess the problem is at the first step which shoud be 
>>>> libreoffice jpg to pdf conversion. That may make sense since we are 
>>>> all using the same version of unpaper and tesseract and we may no be 
>>>> using the same LibreOffice. 
>>>>
>>>> I'm in a hurry trying to figure which is the best software for my 
>>>> company and I would happly make a donation when i'll have it working 
>>>> localy. 
>>>>
>>>>
>>>> Also, while trying to solve this issue i've came to this observations: 
>>>>
>>>>
>>>> 1 
>>>> -------------------------- 
>>>> Tesseract has to have it's language training files in the usr/local/ 
>>>> in order to work 
>>>>
>>>> like this: 
>>>>
>>>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls 
>>>> cat.traineddata   eng.cube.fold  eng.cube.params 
>>>> eng.tesseract_cube.nn 
>>>> configs           eng.cube.lm    eng.cube.size       eng.traineddata 
>>>> eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs 
>>>>
>>>>
>>>> 2 
>>>> -------------------------- 
>>>> making tesseract to work with a .jpg from the scan has EXTREMELY 
>>>> better results than giving it a ppm "cleaned" by unpaper , in the 
>>>> first case only 5 words in a page where mistaken and a cleaned ppm 
>>>> tesseract gave only 3 comprensible words in the whole page. No PDF 
>>>> (jpg converted via libre office) is accepted by tesseract giving a : 
>>>>
>>>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed 
>>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica 
>>>> Error in pixReadStream: Unknown format: no pix returned 
>>>> Error in pixRead: pix not read 
>>>> Unsupported image type. 
>>>>
>>>> 3 
>>>> ---------------------------- 
>>>> Having a metadata tag indicating a language in mayan and using this to 
>>>> set the language flag of tesseract can improve results a lot ! (50 
>>>> words per page) If my project is finally using mayan i would try to 
>>>> program this feature. 
>>>>
>>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[Mayan EDMS: 606] Re: OCR error on .doc files

Reply via email to