[Mayan EDMS: 1939] Re: OCR quality JPG vs. PDF

Roberto Rosario Thu, 27 Jul 2017 23:13:04 -0700

Thanks!

On Friday, July 28, 2017 at 2:11:44 AM UTC-4, Florian Beverborg wrote:
>
> I've created issue #416.
>
> Regarding the quick fix you mentioned: Maybe it makes more sense to expose 
> this as a per-document-type setting? But that would require much more 
> development and testing, so yeah I can see why that would be nice to have 
> for now. I've gone into more details in the issue, let's take the 
> discussion there ;)
>
> Cheers,
> Flo
>
> Am Freitag, 28. Juli 2017 02:02:21 UTC+2 schrieb Roberto Rosario:
>>
>> Great work Florian! I will find a way to expose this via the settings 
>> system. I think it can be included in the next minor version (2.7). Yes, 
>> please, you can open an issue here: 
>> https://gitlab.com/mayan-edms/mayan-edms/issues
>>
>> A test document would be even greater help. Thank you!
>>
>>
>> On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote:
>>>
>>> Hi Roberto
>>>
>>> I changed the source to force pdftoppm to use 300 dpi for all files. 
>>> This not only fixes the initial issue that PDF has a worse recognition 
>>> quality than JPG, but indeed even improves some details regarding 
>>> punctuation and the quality is now even better in the PDF.
>>>
>>> I regard this issue as resolved now (for myself), but maybe we can find 
>>> a less hacky way for all people? What I did was change line 37 of 
>>> /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py
>>>  
>>> to
>>>
>>> pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')
>>>
>>> Is there a way to open a bug report for me or how do we proceed? I guess 
>>> I could supply you with a test document as well, if needed.
>>>
>>> Cheers,
>>> Flo
>>>
>>> Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
>>>>
>>>> Hello,
>>>>
>>>> I recently published a blog post explaining how the converter works: 
>>>> http://www.mayan-edms.org/post/mayan-converter/
>>>> In the case of PDF files, the utility pdftoppm is used to convert the 
>>>> pages into images. You can use pdftoppm on the PDF files
>>>> made by img2pdf to see the actual image Mayan is receiving and spot any 
>>>> degradation. 
>>>>
>>>> As for your questions:
>>>> 1) The OCR doesn't pre process the images before doing the recognition. 
>>>> This is some being worked on (already there is a scanline filter to reduce 
>>>> pre OCR images to 2 colors), but is not available to the user yet. When 
>>>> available, it will be possible to apply a stack of transformations for the 
>>>> document images before performing the OCR task.  
>>>> 2) Strictly speaking about file types, there is no way to make a 
>>>> multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and 
>>>> JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 
>>>> support). Another JPEG format which could be used is MJPG but it is for 
>>>> video and it would be hackish attempt to convert the frames to pages. On 
>>>> the platform side, you can group images with Mayan already using an Index 
>>>> or a SmartLink. All the JPEG uploads need is a unique marker (like a 
>>>> metadata value or a filename fragment). This can be accomplished via the 
>>>> UI 
>>>> and the API. For example the index template: {{ document.label|slice:":4" 
>>>> }} will group all documents with the same 4 first characters in the name. 
>>>> To use a different part of the filename for the grouping just change the 
>>>> slice argument (
>>>> http://www.diveintopython3.net/native-datatypes.html#slicinglists).
>>>>  
>>>> On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
>>>>>
>>>>> Hi all!
>>>>>
>>>>> I'm currently evaluating Mayan as a replacement for my current DMS. 
>>>>> The documents are all in the JPG format, multiple pages of the same 
>>>>> document per folder, scanned at 300dpi. So far adding JPGs does not allow 
>>>>> me to create multi-page documents. I used img2pdf to generate multi-page 
>>>>> PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality 
>>>>> for the same page is worse when using the PDF files.
>>>>>
>>>>> I've tried multiple ways to generate the combined PDF and I can see 
>>>>> some differences but never managed to get the same recognition quality as 
>>>>> using the pure JPG. Since img2pdf (to my knowledge) does not touch the 
>>>>> actual JPG data and since I'm using PDF page size fit to image size I 
>>>>> don't 
>>>>> know what's going wrong here. The PDFs look fine in my PDF viewer and are 
>>>>> reported to have correct page sizes. Generating the pages with 
>>>>> imagemagick 
>>>>> does not improve recognition.
>>>>>
>>>>> This leads me to the conclusion that the PDFs are rendered internally 
>>>>> which degrades the quality.
>>>>>
>>>>> I have two questions:
>>>>>
>>>>> 1) What can I do to improve PDF recognition quality, either in 
>>>>> generating the PDF or in Mayan settings?
>>>>> 2) Is there another way to make multi-page documents from JPGs? Maybe 
>>>>> using the REST-API?
>>>>>
>>>>> Using Mayan version 2.6.2
>>>>>
>>>>> Cheers,
>>>>> Flo
>>>>>
>>>>>
>>>>>


-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1939] Re: OCR quality JPG vs. PDF

Reply via email to