Dear Albert, > Can you please try adding some documentation in the cpp file like the other > functions have?
Oh, sorry for overlooking that task. Here it is... Regards, mpsuzuki Albert Astals Cid wrote: > El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va > escriure: >> Dear Albert, >> >>> As far as i can see we don't use exceptions anywhere in the poppler code, >>> so i would not introduce them "just for this". >> Oh, I'm sorry for my overlooking that. Here is updated patch which checks >> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index >> is passed. Also I changed the type of the index from int to size_t. > > Can you please try adding some documentation in the cpp file like the other > functions have? > > Cheers, > Albert > >> Regards, >> mpsuzuki >> >> Albert Astals Cid wrote: >>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va >>> >>> escriure: >>>> Dear Albert, >>>> >>>> Oops, I apologize that I overlooked your response to >>>> improve the patch. >>>> >>>>> You have one text_box where you have 17 char bboxes but the >>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>>>> because you get an out of bounds. >>>> Indeed, it might be common scenarios for the >>>> programmers assuming the input as ASCII data but >>>> something different is given. >>>> >>>>> So it would be nice to make >>>>> rectf char_bbox(int i) const; >>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>> and then probably add some documentation about it? >>>> Checking the features of C++ std::vector, I think >>>> we have 2 options: >>>> >>>> a) use (as current patch) m_data->char_bbox[i] and >>>> sanitize the i-index before using it (as you proposed), >>>> and the bad values cause 0x0 rectf. >>>> >>>> b) use m_data->char_bbox.at(i) and let stdlib raise >>>> std::out_of_range error. >>>> >>>> which is better for C++ programmers? >>> As far as i can see we don't use exceptions anywhere in the poppler code, >>> so i would not introduce them "just for this". >>> >>> Cheers, >>> >>> Albert >>> >>>>> FWIW this was a real problem in Okular and i fixed it with >>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf. >>>>> cp >>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>> >>>>> Am i making any sense here? >>>> Thank you very much for the important suggestion and >>>> sorry for delayed reponse! >>>> >>>> Regards, >>>> mpsuzuki >>>> >>>> Albert Astals Cid wrote: >>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va >>> escriure: >>>>>> Hello mpsuzuki, >>>>>> >>>>>> I do not think you need the two indirections: Since you want to keep >>>>>> that m_data indirection for ABI compatibility, just return >>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its >>>>>> data with the additional getter methods. (You just need to make sure to >>>>>> default the move constructor and assignment operator since no copying >>>>>> implies no moving by default.) >>>>>> >>>>>> Attached is an update patch along those lines. (I also took the liberty >>>>>> to replace some bare deletes by use of unique_ptr in the implementation >>>>>> of page::text_list as well.) >>>>> One thing that i wanted to improve in the Qt5 frontend but never go to, >>>>> maybe we can start here is the fact that the number of boxes is "a bit >>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189 >>>>> >>>>> You have one text_box where you have 17 char bboxes but the >>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>>>> because you get an out of bounds. >>>>> >>>>> So it would be nice to make >>>>> rectf char_bbox(int i) const; >>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>> and then probably add some documentation about it? >>>>> >>>>> FWIW this was a real problem in Okular and i fixed it with >>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf. >>>>> cp >>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>> >>>>> Am i making any sense here? >>>>> >>>>> Cheers, >>>>> >>>>> Albert >>>>>> Best regards, Adam. >>>>>> >>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: >>>>>>> Dear Adam, >>>>>>> >>>>>>> Thank you for the technical information on the difference of >>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace >>>>>>> the vector of the pointers to that of the unique_ptrs, and >>>>>>> remove the next_xxx methods. >>>>>>> >>>>>>> I attached the revised patch (gzipped) to the master, and >>>>>>> here I quote the revised part from my last revision, for >>>>>>> convenience to take a look on it. >>>>>>> >>>>>>> Regards, >>>>>>> mpsuzuki >>>>>>> >>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h >>>>>>> index 726068a..2cedd1e 100644 >>>>>>> --- a/cpp/poppler-private.h >>>>>>> +++ b/cpp/poppler-private.h >>>>>>> @@ -72,12 +72,11 @@ class text_box_data >>>>>>> >>>>>>> { >>>>>>> >>>>>>> public: >>>>>>> text_box_data() >>>>>>> >>>>>>> - : next_text_box(0), has_space_after(false) >>>>>>> + : has_space_after(false) >>>>>>> >>>>>>> { >>>>>>> } >>>>>>> ustring text; >>>>>>> rectf bbox; >>>>>>> >>>>>>> - text_box *next_text_box; >>>>>>> >>>>>>> std::vector<rectf> char_bboxes; >>>>>>> bool has_space_after; >>>>>>> >>>>>>> }; >>>>>>> >>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h >>>>>>> index 18af213..3013997 100644 >>>>>>> --- a/cpp/poppler-page.h >>>>>>> +++ b/cpp/poppler-page.h >>>>>>> @@ -22,6 +22,8 @@ >>>>>>> >>>>>>> #include "poppler-global.h" >>>>>>> #include "poppler-rectangle.h" >>>>>>> >>>>>>> +#include <memory> >>>>>>> + >>>>>>> >>>>>>> namespace poppler >>>>>>> { >>>>>>> >>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { >>>>>>> >>>>>>> ~text_box(); >>>>>>> ustring text() const; >>>>>>> rectf bbox() const; >>>>>>> >>>>>>> - text_box *next_text_box() const; >>>>>>> - text_box *next_word() { return this->next_text_box(); }; >>>>>>> >>>>>>> rectf char_bbox(int i) const; >>>>>>> bool has_space_after() const; >>>>>>> >>>>>>> private: >>>>>>> - text_box_data* m_data; >>>>>>> + std::unique_ptr<text_box_data> m_data; >>>>>>> >>>>>>> }; >>>>>>> >>>>>>> class document; >>>>>>> >>>>>>> @@ -79,7 +79,7 @@ public: >>>>>>> ustring text(const rectf &rect = rectf()) const; >>>>>>> ustring text(const rectf &rect, text_layout_enum layout_mode) >>>>>>> const; >>>>>>> >>>>>>> - std::vector<text_box*> text_list(rotation_enum rotation) const; >>>>>>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum >>>>>>> rotation) const;> >>>>>>> >>>>>>> private: >>>>>>> page(document_private *doc, int index); >>>>>>> >>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp >>>>>>> index 9461ab9..5da2e1d 100644 >>>>>>> --- a/cpp/poppler-page.cpp >>>>>>> +++ b/cpp/poppler-page.cpp >>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, >>>>>>> text_layout_enum >>>>>>> layout_mode) const >>>>>>> >>>>>>> */ >>>>>>> >>>>>>> text_box::text_box(const ustring& text, const rectf &bbox) >>>>>>> { >>>>>>> >>>>>>> - m_data = new text_box_data(); >>>>>>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); >>>>>>> >>>>>>> m_data->text = text; >>>>>>> m_data->bbox = bbox; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> text_box::~text_box() >>>>>>> { >>>>>>> >>>>>>> - delete m_data; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> ustring text_box::text() const >>>>>>> >>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const >>>>>>> >>>>>>> return m_data->bbox; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> -text_box* text_box::next_text_box() const >>>>>>> -{ >>>>>>> - return m_data->next_text_box; >>>>>>> -} >>>>>>> - >>>>>>> >>>>>>> rectf text_box::char_bbox(int i) const >>>>>>> { >>>>>>> >>>>>>> return m_data->char_bboxes[i]; >>>>>>> >>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const >>>>>>> >>>>>>> return m_data->has_space_after; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const >>>>>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum >>>>>>> rotate) const> >>>>>>> >>>>>>> { >>>>>>> >>>>>>> TextOutputDev *output_dev; >>>>>>> >>>>>>> - std::vector<text_box*> output_list; >>>>>>> + std::vector<std::unique_ptr<text_box>> output_list; >>>>>>> >>>>>>> const int rotation_value = (int)rotate * 90; >>>>>>> >>>>>>> /* config values are same with Qt5 Page::TextList() */ >>>>>>> >>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*> >>>>>>> page::text_list(rotation_enum >>>>>>> rotate) const >>>>>>> >>>>>>> double xMin, yMin, xMax, yMax; >>>>>>> word->getBBox(&xMin, &yMin, &xMax, &yMax); >>>>>>> >>>>>>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>>>>>> yMax-yMin)); + std::unique_ptr<text_box> tb = >>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, >>>>>>> xMax-xMin, >>>>>>> yMax-yMin))); >>>>>>> >>>>>>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); >>>>>>> >>>>>>> tb->m_data->char_bboxes.reserve(word->getLength()); >>>>>>> >>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*> >>>>>>> page::text_list(rotation_enum >>>>>>> rotate) const >>>>>>> >>>>>>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax- > xMin, >>>>>>> yMax-yMin)); >>>>>>> >>>>>>> } >>>>>>> >>>>>>> - if (output_list.size() > 0) >>>>>>> - output_list.back()->m_data->next_text_box = tb; >>>>>>> - >>>>>>> - output_list.push_back(tb); >>>>>>> + output_list.push_back(std::move(tb)); >>>>>>> >>>>>>> } >>>>>>> delete word_list; >>>>>>> delete output_dev; >>>>>>> >>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp >>>>>>> index 09b180d..d26a0ef 100644 >>>>>>> --- a/cpp/tests/poppler-dump.cpp >>>>>>> +++ b/cpp/tests/poppler-dump.cpp >>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p) >>>>>>> >>>>>>> std::cout << std::endl; >>>>>>> return; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> - std::vector<poppler::text_box*> text_list = >>>>>>> p->text_list(poppler::rotate_0); + >>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list = >>>>>>> p->text_list(poppler::rotate_0); >>>>>>> >>>>>>> std::cout << "---" << std::endl; >>>>>>> for (size_t i = 0; i < text_list.size(); i ++) { >>>>>>> >>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p) >>>>>>> >>>>>>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" >>>>>>> << >>>>>>> >>>>>>> bbox.width() << " h=" << bbox.height() << " )"; >>>>>>> >>>>>>> std::cout << std::endl; >>>>>>> >>>>>>> - delete text_list[i]; >>>>>>> >>>>>>> } >>>>>>> std::cout << "---" << std::endl; >>>>>>> >>>>>>> } >>>>>>> >>>>>>> Adam Reichold wrote: >>>>>>>> Hello mpsuzuki, >>>>>>>> >>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 >>>>>>>> only >>>>>>>> since a relatively short time, so I guess there just was no time to >>>>>>>> use >>>>>>>> it anywhere. >>>>>>>> >>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic >>>>>>>> memory accesses on the reference counter which will slow down even >>>>>>>> single-threaded programs which do not need it. Hence I would suggest >>>>>>>> starting with std::unique_ptr to get a move-only type. >>>>>>>> >>>>>>>> Best regards, Adam. >>>>>>>> >>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: >>>>>>>>> Dear Albert, Adam, >>>>>>>>> >>>>>>>>> Thank you for the helpful review. I should have more >>>>>>>>> training on C++... Please let me spend to improve the >>>>>>>>> patch. >>>>>>>>> >>>>>>>>>> (Another easy option albeit with some >>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>> data >>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>> Yeah, from some posts in stackoverflow, I found about >>>>>>>>> shared_ptr... But it seems that current poppler does >>>>>>>>> not use shared_ptr at all. Is this a coding policy? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> mpsuzuki >>>>>>>>> >>>>>>>>> Adam Reichold wrote: >>>>>>>>>> Hello >>>>>>>>>> >>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: >>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki >>>>>>>>>>> toshiya >>>>>>>>>>> va >>>>>>>>>>> >>>>>>>>>>> escriure: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I've tried to implement the suggestion, I attached my current >>>>>>>>>>>> patch. >>>>>>>>>>>> >>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and >>>>>>>>>>>> renamed, >>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing, >>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote >>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which >>>>>>>>>>>> calls next_text_box() internally. >>>>>>>>>>>> >>>>>>>>>>>> Another point I want to discuss is the design of the list give by >>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() >>>>>>>>>>>> returns >>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns >>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend. >>>>>>>>>>>> >>>>>>>>>>>> But, if we return the vector of pointers, the client should >>>>>>>>>>>> destruct >>>>>>>>>>>> the objects pointed by the vector, before destructing vector >>>>>>>>>>>> itself. >>>>>>>>>>>> Using a vector of text_box (not the pointer but the object >>>>>>>>>>>> itself), >>>>>>>>>>>> like std::vector<text_box>, could be better, because the >>>>>>>>>>>> destructor >>>>>>>>>>>> of the vector would internally call the destructor for text_box >>>>>>>>>>>> object. >>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have >>>>>>>>>>>> such). >>>>>>>>>>>> If I'm misunderstanding about C++, please correct. >>>>>>>>>>> First you need to fix the text_box class, you either need to make >>>>>>>>>>> it >>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you >>>>>>>>>>> need >>>>>>>>>>> to make the copy and assignment operators not break your class >>>>>>>>>>> since >>>>>>>>>>> at the moment the default operators will just copy the pointer and >>>>>>>>>>> everything will break because you will end up with a dangling >>>>>>>>>>> pointer. >>>>>>>>>>> >>>>>>>>>>> If you go for the second option you'll want the move constructor >>>>>>>>>>> and >>>>>>>>>>> move >>>>>>>>>>> operator too. >>>>>>>>>> Should the first option also be viable if text_box becomes a >>>>>>>>>> move-only >>>>>>>>>> type, i.e. no copying but move construction and assignment? I think >>>>>>>>>> std::vector can handle it. (Another easy option albeit with some >>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>> data >>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>>> >>>>>>>>>> Best regards, Adam. >>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Albert >>>>>>>>>>>> Regards, >>>>>>>>>>>> mpsuzuki >>>>>>>>>>>> >>>>>>>>>>>> Albert Astals Cid wrote: >>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen >>>>>>>>>>>>> Ooms >>>>>>>>>>>>> va >>>>>>>>>>>>> >>>>>>>>>>>>> escriure: >>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf >>>>>>>>>>>>>> document, >>>>>>>>>>>>>> including the position of each text box? Currently we use >>>>>>>>>>>>>> page->text() >>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but I >>>>>>>>>>>>>> need >>>>>>>>>>>>>> more detailed information about each text box per page. >>>>>>>>>>>>> You want to code the variant of qt5 frontend >>>>>>>>>>>>> Poppler::Page::textList() for >>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration >>>>>>>>>>>>> (i.e. >>>>>>>>>>>>> almost-copying) the code, do you have time for it? >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Albert >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> poppler mailing list >>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > > _______________________________________________ > poppler mailing list > poppler@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/poppler
add-cpp-text-list_20180208-1215.diff.xz
Description: Binary data
_______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler