Dear Albert, Thanks, thanks, thanks! I was not aware that you're too busy to process for this, please don't say sorry. I appreciate your continuous effort and I apologize that I made you troubled about this.
Regards, mpsuzuki Albert Astals Cid wrote: > El dijous, 8 de febrer de 2018, a les 4:17:33 CET, suzuki toshiya va escriure: >> Dear Albert, >> >>> Can you please try adding some documentation in the cpp file like the >>> other >>> functions have? >> Oh, sorry for overlooking that task. Here it is... > > Commited. > >> Regards, >> mpsuzuki >> >> Albert Astals Cid wrote: >>> El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va >>> >>> escriure: >>>> Dear Albert, >>>> >>>>> As far as i can see we don't use exceptions anywhere in the poppler >>>>> code, >>>>> so i would not introduce them "just for this". >>>> Oh, I'm sorry for my overlooking that. Here is updated patch which checks >>>> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index >>>> is passed. Also I changed the type of the index from int to size_t. >>> Can you please try adding some documentation in the cpp file like the >>> other >>> functions have? >>> >>> Cheers, >>> >>> Albert >>>> Regards, >>>> mpsuzuki >>>> >>>> Albert Astals Cid wrote: >>>>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va >>>>> >>>>> escriure: >>>>>> Dear Albert, >>>>>> >>>>>> Oops, I apologize that I overlooked your response to >>>>>> improve the patch. >>>>>> >>>>>>> You have one text_box where you have 17 char bboxes but the >>>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a >>>>>>> crash >>>>>>> because you get an out of bounds. >>>>>> Indeed, it might be common scenarios for the >>>>>> programmers assuming the input as ASCII data but >>>>>> something different is given. >>>>>> >>>>>>> So it would be nice to make >>>>>>> rectf char_bbox(int i) const; >>>>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>>>> and then probably add some documentation about it? >>>>>> Checking the features of C++ std::vector, I think >>>>>> we have 2 options: >>>>>> >>>>>> a) use (as current patch) m_data->char_bbox[i] and >>>>>> sanitize the i-index before using it (as you proposed), >>>>>> and the bad values cause 0x0 rectf. >>>>>> >>>>>> b) use m_data->char_bbox.at(i) and let stdlib raise >>>>>> std::out_of_range error. >>>>>> >>>>>> which is better for C++ programmers? >>>>> As far as i can see we don't use exceptions anywhere in the poppler >>>>> code, >>>>> so i would not introduce them "just for this". >>>>> >>>>> Cheers, >>>>> >>>>> Albert >>>>> >>>>>>> FWIW this was a real problem in Okular and i fixed it with >>>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pd >>>>>>> f. >>>>>>> cp >>>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>>>> >>>>>>> Am i making any sense here? >>>>>> Thank you very much for the important suggestion and >>>>>> sorry for delayed reponse! >>>>>> >>>>>> Regards, >>>>>> mpsuzuki >>>>>> >>>>>> Albert Astals Cid wrote: >>>>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va >>>>> escriure: >>>>>>>> Hello mpsuzuki, >>>>>>>> >>>>>>>> I do not think you need the two indirections: Since you want to keep >>>>>>>> that m_data indirection for ABI compatibility, just return >>>>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its >>>>>>>> data with the additional getter methods. (You just need to make sure >>>>>>>> to >>>>>>>> default the move constructor and assignment operator since no copying >>>>>>>> implies no moving by default.) >>>>>>>> >>>>>>>> Attached is an update patch along those lines. (I also took the >>>>>>>> liberty >>>>>>>> to replace some bare deletes by use of unique_ptr in the >>>>>>>> implementation >>>>>>>> of page::text_list as well.) >>>>>>> One thing that i wanted to improve in the Qt5 frontend but never go >>>>>>> to, >>>>>>> maybe we can start here is the fact that the number of boxes is "a bit >>>>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189 >>>>>>> >>>>>>> You have one text_box where you have 17 char bboxes but the >>>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a >>>>>>> crash >>>>>>> because you get an out of bounds. >>>>>>> >>>>>>> So it would be nice to make >>>>>>> rectf char_bbox(int i) const; >>>>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>>>> and then probably add some documentation about it? >>>>>>> >>>>>>> FWIW this was a real problem in Okular and i fixed it with >>>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pd >>>>>>> f. >>>>>>> cp >>>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>>>> >>>>>>> Am i making any sense here? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Albert >>>>>>>> Best regards, Adam. >>>>>>>> >>>>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: >>>>>>>>> Dear Adam, >>>>>>>>> >>>>>>>>> Thank you for the technical information on the difference of >>>>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace >>>>>>>>> the vector of the pointers to that of the unique_ptrs, and >>>>>>>>> remove the next_xxx methods. >>>>>>>>> >>>>>>>>> I attached the revised patch (gzipped) to the master, and >>>>>>>>> here I quote the revised part from my last revision, for >>>>>>>>> convenience to take a look on it. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> mpsuzuki >>>>>>>>> >>>>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h >>>>>>>>> index 726068a..2cedd1e 100644 >>>>>>>>> --- a/cpp/poppler-private.h >>>>>>>>> +++ b/cpp/poppler-private.h >>>>>>>>> @@ -72,12 +72,11 @@ class text_box_data >>>>>>>>> >>>>>>>>> { >>>>>>>>> >>>>>>>>> public: >>>>>>>>> text_box_data() >>>>>>>>> >>>>>>>>> - : next_text_box(0), has_space_after(false) >>>>>>>>> + : has_space_after(false) >>>>>>>>> >>>>>>>>> { >>>>>>>>> } >>>>>>>>> ustring text; >>>>>>>>> rectf bbox; >>>>>>>>> >>>>>>>>> - text_box *next_text_box; >>>>>>>>> >>>>>>>>> std::vector<rectf> char_bboxes; >>>>>>>>> bool has_space_after; >>>>>>>>> >>>>>>>>> }; >>>>>>>>> >>>>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h >>>>>>>>> index 18af213..3013997 100644 >>>>>>>>> --- a/cpp/poppler-page.h >>>>>>>>> +++ b/cpp/poppler-page.h >>>>>>>>> @@ -22,6 +22,8 @@ >>>>>>>>> >>>>>>>>> #include "poppler-global.h" >>>>>>>>> #include "poppler-rectangle.h" >>>>>>>>> >>>>>>>>> +#include <memory> >>>>>>>>> + >>>>>>>>> >>>>>>>>> namespace poppler >>>>>>>>> { >>>>>>>>> >>>>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { >>>>>>>>> >>>>>>>>> ~text_box(); >>>>>>>>> ustring text() const; >>>>>>>>> rectf bbox() const; >>>>>>>>> >>>>>>>>> - text_box *next_text_box() const; >>>>>>>>> - text_box *next_word() { return this->next_text_box(); }; >>>>>>>>> >>>>>>>>> rectf char_bbox(int i) const; >>>>>>>>> bool has_space_after() const; >>>>>>>>> >>>>>>>>> private: >>>>>>>>> - text_box_data* m_data; >>>>>>>>> + std::unique_ptr<text_box_data> m_data; >>>>>>>>> >>>>>>>>> }; >>>>>>>>> >>>>>>>>> class document; >>>>>>>>> >>>>>>>>> @@ -79,7 +79,7 @@ public: >>>>>>>>> ustring text(const rectf &rect = rectf()) const; >>>>>>>>> ustring text(const rectf &rect, text_layout_enum layout_mode) >>>>>>>>> const; >>>>>>>>> >>>>>>>>> - std::vector<text_box*> text_list(rotation_enum rotation) const; >>>>>>>>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum >>>>>>>>> rotation) const;> >>>>>>>>> >>>>>>>>> private: >>>>>>>>> page(document_private *doc, int index); >>>>>>>>> >>>>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp >>>>>>>>> index 9461ab9..5da2e1d 100644 >>>>>>>>> --- a/cpp/poppler-page.cpp >>>>>>>>> +++ b/cpp/poppler-page.cpp >>>>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, >>>>>>>>> text_layout_enum >>>>>>>>> layout_mode) const >>>>>>>>> >>>>>>>>> */ >>>>>>>>> >>>>>>>>> text_box::text_box(const ustring& text, const rectf &bbox) >>>>>>>>> { >>>>>>>>> >>>>>>>>> - m_data = new text_box_data(); >>>>>>>>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); >>>>>>>>> >>>>>>>>> m_data->text = text; >>>>>>>>> m_data->bbox = bbox; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> text_box::~text_box() >>>>>>>>> { >>>>>>>>> >>>>>>>>> - delete m_data; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> ustring text_box::text() const >>>>>>>>> >>>>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const >>>>>>>>> >>>>>>>>> return m_data->bbox; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> -text_box* text_box::next_text_box() const >>>>>>>>> -{ >>>>>>>>> - return m_data->next_text_box; >>>>>>>>> -} >>>>>>>>> - >>>>>>>>> >>>>>>>>> rectf text_box::char_bbox(int i) const >>>>>>>>> { >>>>>>>>> >>>>>>>>> return m_data->char_bboxes[i]; >>>>>>>>> >>>>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const >>>>>>>>> >>>>>>>>> return m_data->has_space_after; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const >>>>>>>>> +std::vector<std::unique_ptr<text_box>> >>>>>>>>> page::text_list(rotation_enum >>>>>>>>> rotate) const> >>>>>>>>> >>>>>>>>> { >>>>>>>>> >>>>>>>>> TextOutputDev *output_dev; >>>>>>>>> >>>>>>>>> - std::vector<text_box*> output_list; >>>>>>>>> + std::vector<std::unique_ptr<text_box>> output_list; >>>>>>>>> >>>>>>>>> const int rotation_value = (int)rotate * 90; >>>>>>>>> >>>>>>>>> /* config values are same with Qt5 Page::TextList() */ >>>>>>>>> >>>>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*> >>>>>>>>> page::text_list(rotation_enum >>>>>>>>> rotate) const >>>>>>>>> >>>>>>>>> double xMin, yMin, xMax, yMax; >>>>>>>>> word->getBBox(&xMin, &yMin, &xMax, &yMax); >>>>>>>>> >>>>>>>>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>>>>>>>> yMax-yMin)); + std::unique_ptr<text_box> tb = >>>>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, >>>>>>>>> xMax-xMin, >>>>>>>>> yMax-yMin))); >>>>>>>>> >>>>>>>>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); >>>>>>>>> >>>>>>>>> tb->m_data->char_bboxes.reserve(word->getLength()); >>>>>>>>> >>>>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*> >>>>>>>>> page::text_list(rotation_enum >>>>>>>>> rotate) const >>>>>>>>> >>>>>>>>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax- >>> xMin, >>> >>>>>>>>> yMax-yMin)); >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> - if (output_list.size() > 0) >>>>>>>>> - output_list.back()->m_data->next_text_box = tb; >>>>>>>>> - >>>>>>>>> - output_list.push_back(tb); >>>>>>>>> + output_list.push_back(std::move(tb)); >>>>>>>>> >>>>>>>>> } >>>>>>>>> delete word_list; >>>>>>>>> delete output_dev; >>>>>>>>> >>>>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp >>>>>>>>> index 09b180d..d26a0ef 100644 >>>>>>>>> --- a/cpp/tests/poppler-dump.cpp >>>>>>>>> +++ b/cpp/tests/poppler-dump.cpp >>>>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page >>>>>>>>> *p) >>>>>>>>> >>>>>>>>> std::cout << std::endl; >>>>>>>>> return; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> - std::vector<poppler::text_box*> text_list = >>>>>>>>> p->text_list(poppler::rotate_0); + >>>>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list = >>>>>>>>> p->text_list(poppler::rotate_0); >>>>>>>>> >>>>>>>>> std::cout << "---" << std::endl; >>>>>>>>> for (size_t i = 0; i < text_list.size(); i ++) { >>>>>>>>> >>>>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page >>>>>>>>> *p) >>>>>>>>> >>>>>>>>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " >>>>>>>>> w=" >>>>>>>>> << >>>>>>>>> >>>>>>>>> bbox.width() << " h=" << bbox.height() << " )"; >>>>>>>>> >>>>>>>>> std::cout << std::endl; >>>>>>>>> >>>>>>>>> - delete text_list[i]; >>>>>>>>> >>>>>>>>> } >>>>>>>>> std::cout << "---" << std::endl; >>>>>>>>> >>>>>>>>> } >>>>>>>>> >>>>>>>>> Adam Reichold wrote: >>>>>>>>>> Hello mpsuzuki, >>>>>>>>>> >>>>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 >>>>>>>>>> only >>>>>>>>>> since a relatively short time, so I guess there just was no time to >>>>>>>>>> use >>>>>>>>>> it anywhere. >>>>>>>>>> >>>>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic >>>>>>>>>> memory accesses on the reference counter which will slow down even >>>>>>>>>> single-threaded programs which do not need it. Hence I would >>>>>>>>>> suggest >>>>>>>>>> starting with std::unique_ptr to get a move-only type. >>>>>>>>>> >>>>>>>>>> Best regards, Adam. >>>>>>>>>> >>>>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: >>>>>>>>>>> Dear Albert, Adam, >>>>>>>>>>> >>>>>>>>>>> Thank you for the helpful review. I should have more >>>>>>>>>>> training on C++... Please let me spend to improve the >>>>>>>>>>> patch. >>>>>>>>>>> >>>>>>>>>>>> (Another easy option albeit with some >>>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>>>> data >>>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>>>> Yeah, from some posts in stackoverflow, I found about >>>>>>>>>>> shared_ptr... But it seems that current poppler does >>>>>>>>>>> not use shared_ptr at all. Is this a coding policy? >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> mpsuzuki >>>>>>>>>>> >>>>>>>>>>> Adam Reichold wrote: >>>>>>>>>>>> Hello >>>>>>>>>>>> >>>>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: >>>>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki >>>>>>>>>>>>> toshiya >>>>>>>>>>>>> va >>>>>>>>>>>>> >>>>>>>>>>>>> escriure: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've tried to implement the suggestion, I attached my current >>>>>>>>>>>>>> patch. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and >>>>>>>>>>>>>> renamed, >>>>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly >>>>>>>>>>>>>> confusing, >>>>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote >>>>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which >>>>>>>>>>>>>> calls next_text_box() internally. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another point I want to discuss is the design of the list give >>>>>>>>>>>>>> by >>>>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() >>>>>>>>>>>>>> returns >>>>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns >>>>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend. >>>>>>>>>>>>>> >>>>>>>>>>>>>> But, if we return the vector of pointers, the client should >>>>>>>>>>>>>> destruct >>>>>>>>>>>>>> the objects pointed by the vector, before destructing vector >>>>>>>>>>>>>> itself. >>>>>>>>>>>>>> Using a vector of text_box (not the pointer but the object >>>>>>>>>>>>>> itself), >>>>>>>>>>>>>> like std::vector<text_box>, could be better, because the >>>>>>>>>>>>>> destructor >>>>>>>>>>>>>> of the vector would internally call the destructor for text_box >>>>>>>>>>>>>> object. >>>>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have >>>>>>>>>>>>>> such). >>>>>>>>>>>>>> If I'm misunderstanding about C++, please correct. >>>>>>>>>>>>> First you need to fix the text_box class, you either need to >>>>>>>>>>>>> make >>>>>>>>>>>>> it >>>>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you >>>>>>>>>>>>> need >>>>>>>>>>>>> to make the copy and assignment operators not break your class >>>>>>>>>>>>> since >>>>>>>>>>>>> at the moment the default operators will just copy the pointer >>>>>>>>>>>>> and >>>>>>>>>>>>> everything will break because you will end up with a dangling >>>>>>>>>>>>> pointer. >>>>>>>>>>>>> >>>>>>>>>>>>> If you go for the second option you'll want the move constructor >>>>>>>>>>>>> and >>>>>>>>>>>>> move >>>>>>>>>>>>> operator too. >>>>>>>>>>>> Should the first option also be viable if text_box becomes a >>>>>>>>>>>> move-only >>>>>>>>>>>> type, i.e. no copying but move construction and assignment? I >>>>>>>>>>>> think >>>>>>>>>>>> std::vector can handle it. (Another easy option albeit with some >>>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>>>> data >>>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>>>>> >>>>>>>>>>>> Best regards, Adam. >>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Albert >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> mpsuzuki >>>>>>>>>>>>>> >>>>>>>>>>>>>> Albert Astals Cid wrote: >>>>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, >>>>>>>>>>>>>>> Jeroen >>>>>>>>>>>>>>> Ooms >>>>>>>>>>>>>>> va >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> escriure: >>>>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf >>>>>>>>>>>>>>>> document, >>>>>>>>>>>>>>>> including the position of each text box? Currently we use >>>>>>>>>>>>>>>> page->text() >>>>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> need >>>>>>>>>>>>>>>> more detailed information about each text box per page. >>>>>>>>>>>>>>> You want to code the variant of qt5 frontend >>>>>>>>>>>>>>> Poppler::Page::textList() for >>>>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration >>>>>>>>>>>>>>> (i.e. >>>>>>>>>>>>>>> almost-copying) the code, do you have time for it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Albert >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>> _______________________________________________ >>> poppler mailing list >>> poppler@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > > _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler