[Mayan EDMS: 1916] Re: OCR quality JPG vs. PDF

Florian Beverborg Mon, 24 Jul 2017 23:43:47 -0700

Hi Roberto

Using an index would keep metadata per page, right? That would not be 
ideal, but I'll look into that and also into SmartLinks.


Regarding pdftoppm, on the manpage is says:
*-r* *number*
Specifies the X and Y resolution, in DPI. The default is 150 DPI.

Is it possible that the DPI value saved in the JPGs (explicitly set by me 
to 300x300 with unit type "DPI") is not carried over to the ppm file or the 
OCR process? I've seen similar OCR issues with tessaract when the DPI value 
is not correct. Is there a way to force the DPI to 300 for all documents 
(all of them are scanned at 300 DPI), maybe editing the call to pdftoppm in 
the code as a quick fix for me? Or maybe this is already implemented as a 
file metadata flag?

Regards,
Flo

Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
>
> Hello,
>
> I recently published a blog post explaining how the converter works: 
> http://www.mayan-edms.org/post/mayan-converter/
> In the case of PDF files, the utility pdftoppm is used to convert the 
> pages into images. You can use pdftoppm on the PDF files
> made by img2pdf to see the actual image Mayan is receiving and spot any 
> degradation. 
>
> As for your questions:
> 1) The OCR doesn't pre process the images before doing the recognition. 
> This is some being worked on (already there is a scanline filter to reduce 
> pre OCR images to 2 colors), but is not available to the user yet. When 
> available, it will be possible to apply a stack of transformations for the 
> document images before performing the OCR task.  
> 2) Strictly speaking about file types, there is no way to make a 
> multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and 
> JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 
> support). Another JPEG format which could be used is MJPG but it is for 
> video and it would be hackish attempt to convert the frames to pages. On 
> the platform side, you can group images with Mayan already using an Index 
> or a SmartLink. All the JPEG uploads need is a unique marker (like a 
> metadata value or a filename fragment). This can be accomplished via the UI 
> and the API. For example the index template: {{ document.label|slice:":4" 
> }} will group all documents with the same 4 first characters in the name. 
> To use a different part of the filename for the grouping just change the 
> slice argument (
> http://www.diveintopython3.net/native-datatypes.html#slicinglists).
>  
> On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
>>
>> Hi all!
>>
>> I'm currently evaluating Mayan as a replacement for my current DMS. The 
>> documents are all in the JPG format, multiple pages of the same document 
>> per folder, scanned at 300dpi. So far adding JPGs does not allow me to 
>> create multi-page documents. I used img2pdf to generate multi-page PDFs for 
>> import into Mayan, which mostly works fine. BUT: The OCR-quality for the 
>> same page is worse when using the PDF files.
>>
>> I've tried multiple ways to generate the combined PDF and I can see some 
>> differences but never managed to get the same recognition quality as using 
>> the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG 
>> data and since I'm using PDF page size fit to image size I don't know 
>> what's going wrong here. The PDFs look fine in my PDF viewer and are 
>> reported to have correct page sizes. Generating the pages with imagemagick 
>> does not improve recognition.
>>
>> This leads me to the conclusion that the PDFs are rendered internally 
>> which degrades the quality.
>>
>> I have two questions:
>>
>> 1) What can I do to improve PDF recognition quality, either in generating 
>> the PDF or in Mayan settings?
>> 2) Is there another way to make multi-page documents from JPGs? Maybe 
>> using the REST-API?
>>
>> Using Mayan version 2.6.2
>>
>> Cheers,
>> Flo
>>
>>
>>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1916] Re: OCR quality JPG vs. PDF

Reply via email to