Great work Florian! I will find a way to expose this via the settings
system. I think it can be included in the next minor version (2.7). Yes,
please, you can open an issue
here: https://gitlab.com/mayan-edms/mayan-edms/issues
A test document would be even greater help. Thank you!
On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote:
>
> Hi Roberto
>
> I changed the source to force pdftoppm to use 300 dpi for all files. This
> not only fixes the initial issue that PDF has a worse recognition quality
> than JPG, but indeed even improves some details regarding punctuation and
> the quality is now even better in the PDF.
>
> I regard this issue as resolved now (for myself), but maybe we can find a
> less hacky way for all people? What I did was change line 37 of
> /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py
>
> to
>
> pdftoppm = pdftoppm.bake('-jpeg', '-r', '300')
>
> Is there a way to open a bug report for me or how do we proceed? I guess I
> could supply you with a test document as well, if needed.
>
> Cheers,
> Flo
>
> Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario:
>>
>> Hello,
>>
>> I recently published a blog post explaining how the converter works:
>> http://www.mayan-edms.org/post/mayan-converter/
>> In the case of PDF files, the utility pdftoppm is used to convert the
>> pages into images. You can use pdftoppm on the PDF files
>> made by img2pdf to see the actual image Mayan is receiving and spot any
>> degradation.
>>
>> As for your questions:
>> 1) The OCR doesn't pre process the images before doing the recognition.
>> This is some being worked on (already there is a scanline filter to reduce
>> pre OCR images to 2 colors), but is not available to the user yet. When
>> available, it will be possible to apply a stack of transformations for the
>> document images before performing the OCR task.
>> 2) Strictly speaking about file types, there is no way to make a
>> multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and
>> JPX extenstions which might do but I don't how good is Pillow's JPEG 2000
>> support). Another JPEG format which could be used is MJPG but it is for
>> video and it would be hackish attempt to convert the frames to pages. On
>> the platform side, you can group images with Mayan already using an Index
>> or a SmartLink. All the JPEG uploads need is a unique marker (like a
>> metadata value or a filename fragment). This can be accomplished via the UI
>> and the API. For example the index template: {{ document.label|slice:":4"
>> }} will group all documents with the same 4 first characters in the name.
>> To use a different part of the filename for the grouping just change the
>> slice argument (
>> http://www.diveintopython3.net/native-datatypes.html#slicinglists).
>>
>> On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote:
>>>
>>> Hi all!
>>>
>>> I'm currently evaluating Mayan as a replacement for my current DMS. The
>>> documents are all in the JPG format, multiple pages of the same document
>>> per folder, scanned at 300dpi. So far adding JPGs does not allow me to
>>> create multi-page documents. I used img2pdf to generate multi-page PDFs for
>>> import into Mayan, which mostly works fine. BUT: The OCR-quality for the
>>> same page is worse when using the PDF files.
>>>
>>> I've tried multiple ways to generate the combined PDF and I can see some
>>> differences but never managed to get the same recognition quality as using
>>> the pure JPG. Since img2pdf (to my knowledge) does not touch the actual JPG
>>> data and since I'm using PDF page size fit to image size I don't know
>>> what's going wrong here. The PDFs look fine in my PDF viewer and are
>>> reported to have correct page sizes. Generating the pages with imagemagick
>>> does not improve recognition.
>>>
>>> This leads me to the conclusion that the PDFs are rendered internally
>>> which degrades the quality.
>>>
>>> I have two questions:
>>>
>>> 1) What can I do to improve PDF recognition quality, either in
>>> generating the PDF or in Mayan settings?
>>> 2) Is there another way to make multi-page documents from JPGs? Maybe
>>> using the REST-API?
>>>
>>> Using Mayan version 2.6.2
>>>
>>> Cheers,
>>> Flo
>>>
>>>
>>>
--
---
You received this message because you are subscribed to the Google Groups
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.