[Mayan EDMS: 605] Re: OCR error on .doc files

Alek Geldenberg Sun, 04 Aug 2013 07:04:46 -0700

Correction:

Changing /etc/magic file as I described fix uploading files made by MS 
Office.  However, docx files created by Libre Office are still recognized 
as zip files.  I would love to know how /etc/magic has to be modified so 
that docx, xlsx, pptx files created by Libre Office would also be properly 
recognized.


On Sunday, August 4, 2013 9:58:34 AM UTC-4, Alek Geldenberg wrote:
>
> Roberto,
>
> Could you, kindly, post what exactly you did to "upgrade libmagic1 file".  
> I have read some posts about changing /etc/magic file with the content of 
> msooxml.  I tried to upgrade it by two ways:
>
> #   Correct the mimetype with the registered ones:
> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
> >>>>&26         string          word/           Microsoft Word 2007+
> !:mime 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> >>>>&26         string          ppt/            Microsoft PowerPoint 2007+
> !:mime 
> application/vnd.openxmlformats-officedocument.presentationml.presentation
> >>>>&26         string          xl/             Microsoft Excel 2007+
> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> >>>>&26         default         x               Microsoft OOXML
> !:strength +10
>
>
> and this way:
>
> #   Correct the mimetype with the registered ones:
> #     http://technet.microsoft.com/en-us/library/cc179224.aspx
> >>>>&26         string          word/           Microsoft Word 2007+ 
> !:mime 
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> >>>>&26         string          ppt/            Microsoft PowerPoint 2007+ 
> !:mime 
> application/vnd.openxmlformats-officedocument.presentationml.presentation
> >>>>&26         string          xl/             Microsoft Excel 2007+ 
> !:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> >>>>&26         default         x               Microsoft OOXML !:strength 
> +10
>
>
>
> Here is the output of the file command:
> $ file testfile.docx
>
> /etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime 
> application/vnd.openxmlformats-offi' truncated
> /etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime 
> application/vnd.openxmlformat' truncated
> /etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime 
> application/vnd.openxmlformats-off' truncated
>
> However, none of that worked.  The docx files are still uploaded into 
> Mayan as zip files.
>
> I am running Ubuntu 12.04 LTS.
>
>
> I hope you can help me with this issue.
>
>
> On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote:
>>
>> During a recent installation of Mayan, wordprocessing documents (.docx) 
>> were being detected as zip/compressed files and OCR was failing on them. 
>>  .docx are in fact compressed files containing several XML files. 
>>  Upgrading the libmagic1 file allowed the 'file' command to detect the 
>> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was 
>> able to OCR the documents correctly.  This could be one of the causes for 
>> the OCR failure being experienced in the thread.  Check to see if the 
>> 'file' command correctly detects the document type.  
>>
>> This is the current list of file MIME types Mayan will pass to 
>> LibreOffice for conversion to PDF if detected: 
>> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17
>>
>>
>> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:
>>>
>>> Hi Charles, Roberto and Steve, 
>>>
>>> I'm loving this software, i'm actualy planning to start a business of 
>>> files digitalization for small busines and this software is the one 
>>> i'm liking more. 
>>>
>>> I'm having the same problem as you two a simple error given by the 
>>> binaries in the ocr cue. 
>>>
>>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with 
>>> the same error about  "the magic %P", unpaper don't handle pdf !!! So 
>>> Roberto may give us another way to check what is going inside mayan so 
>>> we can simulate it by hand. 
>>>
>>> As far as i see there's no pdf output file from the "document as an 
>>> image" in the temporary folder, just a file called IHAKtmp which is 
>>> empty so i guess the problem is at the first step which shoud be 
>>> libreoffice jpg to pdf conversion. That may make sense since we are 
>>> all using the same version of unpaper and tesseract and we may no be 
>>> using the same LibreOffice. 
>>>
>>> I'm in a hurry trying to figure which is the best software for my 
>>> company and I would happly make a donation when i'll have it working 
>>> localy. 
>>>
>>>
>>> Also, while trying to solve this issue i've came to this observations: 
>>>
>>>
>>> 1 
>>> -------------------------- 
>>> Tesseract has to have it's language training files in the usr/local/ 
>>> in order to work 
>>>
>>> like this: 
>>>
>>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls 
>>> cat.traineddata   eng.cube.fold  eng.cube.params 
>>> eng.tesseract_cube.nn 
>>> configs           eng.cube.lm    eng.cube.size       eng.traineddata 
>>> eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs 
>>>
>>>
>>> 2 
>>> -------------------------- 
>>> making tesseract to work with a .jpg from the scan has EXTREMELY 
>>> better results than giving it a ppm "cleaned" by unpaper , in the 
>>> first case only 5 words in a page where mistaken and a cleaned ppm 
>>> tesseract gave only 3 comprensible words in the whole page. No PDF 
>>> (jpg converted via libre office) is accepted by tesseract giving a : 
>>>
>>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed 
>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica 
>>> Error in pixReadStream: Unknown format: no pix returned 
>>> Error in pixRead: pix not read 
>>> Unsupported image type. 
>>>
>>> 3 
>>> ---------------------------- 
>>> Having a metadata tag indicating a language in mayan and using this to 
>>> set the language flag of tesseract can improve results a lot ! (50 
>>> words per page) If my project is finally using mayan i would try to 
>>> program this feature. 
>>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[Mayan EDMS: 605] Re: OCR error on .doc files

Reply via email to