Thanks! On Friday, July 28, 2017 at 2:11:44 AM UTC-4, Florian Beverborg wrote: > > I've created issue #416. > > Regarding the quick fix you mentioned: Maybe it makes more sense to expose > this as a per-document-type setting? But that would require much more > development and testing, so yeah I can see why that would be nice to have > for now. I've gone into more details in the issue, let's take the > discussion there ;) > > Cheers, > Flo > > Am Freitag, 28. Juli 2017 02:02:21 UTC+2 schrieb Roberto Rosario: >> >> Great work Florian! I will find a way to expose this via the settings >> system. I think it can be included in the next minor version (2.7). Yes, >> please, you can open an issue here: >> https://gitlab.com/mayan-edms/mayan-edms/issues >> >> A test document would be even greater help. Thank you! >> >> >> On Wednesday, July 26, 2017 at 3:54:38 PM UTC-4, Florian Beverborg wrote: >>> >>> Hi Roberto >>> >>> I changed the source to force pdftoppm to use 300 dpi for all files. >>> This not only fixes the initial issue that PDF has a worse recognition >>> quality than JPG, but indeed even improves some details regarding >>> punctuation and the quality is now even better in the PDF. >>> >>> I regard this issue as resolved now (for myself), but maybe we can find >>> a less hacky way for all people? What I did was change line 37 of >>> /usr/local/lib/python2.7/dist-packages/mayan/apps/converter/backends/python.py >>> >>> to >>> >>> pdftoppm = pdftoppm.bake('-jpeg', '-r', '300') >>> >>> Is there a way to open a bug report for me or how do we proceed? I guess >>> I could supply you with a test document as well, if needed. >>> >>> Cheers, >>> Flo >>> >>> Am Dienstag, 25. Juli 2017 08:00:51 UTC+2 schrieb Roberto Rosario: >>>> >>>> Hello, >>>> >>>> I recently published a blog post explaining how the converter works: >>>> http://www.mayan-edms.org/post/mayan-converter/ >>>> In the case of PDF files, the utility pdftoppm is used to convert the >>>> pages into images. You can use pdftoppm on the PDF files >>>> made by img2pdf to see the actual image Mayan is receiving and spot any >>>> degradation. >>>> >>>> As for your questions: >>>> 1) The OCR doesn't pre process the images before doing the recognition. >>>> This is some being worked on (already there is a scanline filter to reduce >>>> pre OCR images to 2 colors), but is not available to the user yet. When >>>> available, it will be possible to apply a stack of transformations for the >>>> document images before performing the OCR task. >>>> 2) Strictly speaking about file types, there is no way to make a >>>> multi-page JPEG, the format doesn't support it (JPEG 2000 has the JPM and >>>> JPX extenstions which might do but I don't how good is Pillow's JPEG 2000 >>>> support). Another JPEG format which could be used is MJPG but it is for >>>> video and it would be hackish attempt to convert the frames to pages. On >>>> the platform side, you can group images with Mayan already using an Index >>>> or a SmartLink. All the JPEG uploads need is a unique marker (like a >>>> metadata value or a filename fragment). This can be accomplished via the >>>> UI >>>> and the API. For example the index template: {{ document.label|slice:":4" >>>> }} will group all documents with the same 4 first characters in the name. >>>> To use a different part of the filename for the grouping just change the >>>> slice argument ( >>>> http://www.diveintopython3.net/native-datatypes.html#slicinglists). >>>> >>>> On Monday, July 24, 2017 at 1:57:31 PM UTC-4, Florian Beverborg wrote: >>>>> >>>>> Hi all! >>>>> >>>>> I'm currently evaluating Mayan as a replacement for my current DMS. >>>>> The documents are all in the JPG format, multiple pages of the same >>>>> document per folder, scanned at 300dpi. So far adding JPGs does not allow >>>>> me to create multi-page documents. I used img2pdf to generate multi-page >>>>> PDFs for import into Mayan, which mostly works fine. BUT: The OCR-quality >>>>> for the same page is worse when using the PDF files. >>>>> >>>>> I've tried multiple ways to generate the combined PDF and I can see >>>>> some differences but never managed to get the same recognition quality as >>>>> using the pure JPG. Since img2pdf (to my knowledge) does not touch the >>>>> actual JPG data and since I'm using PDF page size fit to image size I >>>>> don't >>>>> know what's going wrong here. The PDFs look fine in my PDF viewer and are >>>>> reported to have correct page sizes. Generating the pages with >>>>> imagemagick >>>>> does not improve recognition. >>>>> >>>>> This leads me to the conclusion that the PDFs are rendered internally >>>>> which degrades the quality. >>>>> >>>>> I have two questions: >>>>> >>>>> 1) What can I do to improve PDF recognition quality, either in >>>>> generating the PDF or in Mayan settings? >>>>> 2) Is there another way to make multi-page documents from JPGs? Maybe >>>>> using the REST-API? >>>>> >>>>> Using Mayan version 2.6.2 >>>>> >>>>> Cheers, >>>>> Flo >>>>> >>>>> >>>>>
-- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
