[Mayan EDMS: 604] Re: OCR error on .doc files

Alek Geldenberg Sun, 04 Aug 2013 06:58:50 -0700

Roberto,

Could you, kindly, post what exactly you did to "upgrade libmagic1 file".  
I have read some posts about changing /etc/magic file with the content of 
msooxml.  I tried to upgrade it by two ways:


#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+
!:mime 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime 
application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML
!:strength +10


and this way:

#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+ !:mime 
application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+ 
!:mime 
application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+ 
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML !:strength 
+10



Here is the output of the file command:
$ file testfile.docx

/etc/magic, 31: Warning: description `Microsoft Word 2007+ !:mime 
application/vnd.openxmlformats-offi' truncated
/etc/magic, 32: Warning: description `Microsoft PowerPoint 2007+ !:mime 
application/vnd.openxmlformat' truncated
/etc/magic, 33: Warning: description `Microsoft Excel 2007+ !:mime 
application/vnd.openxmlformats-off' truncated

However, none of that worked.  The docx files are still uploaded into Mayan 
as zip files.

I am running Ubuntu 12.04 LTS.


I hope you can help me with this issue.


On Monday, December 17, 2012 2:09:59 PM UTC-5, Roberto Rosario wrote:
>
> During a recent installation of Mayan, wordprocessing documents (.docx) 
> were being detected as zip/compressed files and OCR was failing on them. 
>  .docx are in fact compressed files containing several XML files. 
>  Upgrading the libmagic1 file allowed the 'file' command to detect the 
> document as a "Microsoft Word 2007+" file and upon reuploading, Mayan was 
> able to OCR the documents correctly.  This could be one of the causes for 
> the OCR failure being experienced in the thread.  Check to see if the 
> 'file' command correctly detects the document type.  
>
> This is the current list of file MIME types Mayan will pass to LibreOffice 
> for conversion to PDF if detected: 
> https://github.com/rosarior/mayan/blob/master/apps/converter/office_converter.py#L17
>
>
> On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:
>>
>> Hi Charles, Roberto and Steve, 
>>
>> I'm loving this software, i'm actualy planning to start a business of 
>> files digitalization for small busines and this software is the one 
>> i'm liking more. 
>>
>> I'm having the same problem as you two a simple error given by the 
>> binaries in the ocr cue. 
>>
>> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with 
>> the same error about  "the magic %P", unpaper don't handle pdf !!! So 
>> Roberto may give us another way to check what is going inside mayan so 
>> we can simulate it by hand. 
>>
>> As far as i see there's no pdf output file from the "document as an 
>> image" in the temporary folder, just a file called IHAKtmp which is 
>> empty so i guess the problem is at the first step which shoud be 
>> libreoffice jpg to pdf conversion. That may make sense since we are 
>> all using the same version of unpaper and tesseract and we may no be 
>> using the same LibreOffice. 
>>
>> I'm in a hurry trying to figure which is the best software for my 
>> company and I would happly make a donation when i'll have it working 
>> localy. 
>>
>>
>> Also, while trying to solve this issue i've came to this observations: 
>>
>>
>> 1 
>> -------------------------- 
>> Tesseract has to have it's language training files in the usr/local/ 
>> in order to work 
>>
>> like this: 
>>
>> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls 
>> cat.traineddata   eng.cube.fold  eng.cube.params 
>> eng.tesseract_cube.nn 
>> configs           eng.cube.lm    eng.cube.size       eng.traineddata 
>> eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs 
>>
>>
>> 2 
>> -------------------------- 
>> making tesseract to work with a .jpg from the scan has EXTREMELY 
>> better results than giving it a ppm "cleaned" by unpaper , in the 
>> first case only 5 words in a page where mistaken and a cleaned ppm 
>> tesseract gave only 3 comprensible words in the whole page. No PDF 
>> (jpg converted via libre office) is accepted by tesseract giving a : 
>>
>> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed 
>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica 
>> Error in pixReadStream: Unknown format: no pix returned 
>> Error in pixRead: pix not read 
>> Unsupported image type. 
>>
>> 3 
>> ---------------------------- 
>> Having a metadata tag indicating a language in mayan and using this to 
>> set the language flag of tesseract can improve results a lot ! (50 
>> words per page) If my project is finally using mayan i would try to 
>> program this feature. 
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[Mayan EDMS: 604] Re: OCR error on .doc files

Reply via email to