Dear Albert, Oops, I apologize that I overlooked your response to improve the patch.
> You have one text_box where you have 17 char bboxes but the > ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash > because you get an out of bounds. Indeed, it might be common scenarios for the programmers assuming the input as ASCII data but something different is given. > So it would be nice to make > rectf char_bbox(int i) const; > not crash on an invalid i, i guess return a 0x0 rect > and then probably add some documentation about it? Checking the features of C++ std::vector, I think we have 2 options: a) use (as current patch) m_data->char_bbox[i] and sanitize the i-index before using it (as you proposed), and the bad values cause 0x0 rectf. b) use m_data->char_bbox.at(i) and let stdlib raise std::out_of_range error. which is better for C++ programmers? > FWIW this was a real problem in Okular and i fixed it with > https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cpp?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > > Am i making any sense here? Thank you very much for the important suggestion and sorry for delayed reponse! Regards, mpsuzuki Albert Astals Cid wrote: > El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va > escriure: >> Hello mpsuzuki, >> >> I do not think you need the two indirections: Since you want to keep >> that m_data indirection for ABI compatibility, just return >> std::vector<text_box> as text_box basically "is" a unique_ptr to its >> data with the additional getter methods. (You just need to make sure to >> default the move constructor and assignment operator since no copying >> implies no moving by default.) >> >> Attached is an update patch along those lines. (I also took the liberty >> to replace some bare deletes by use of unique_ptr in the implementation >> of page::text_list as well.) > > One thing that i wanted to improve in the Qt5 frontend but never go to, maybe > we can start here is the fact that the number of boxes is "a bit random", for > example take https://bugs.kde.org/attachment.cgi?id=66189 > > You have one text_box where you have 17 char bboxes but the > ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash > because you get an out of bounds. > > So it would be nice to make > rectf char_bbox(int i) const; > not crash on an invalid i, i guess return a 0x0 rect > and then probably add some documentation about it? > > FWIW this was a real problem in Okular and i fixed it with > https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cpp?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > > Am i making any sense here? > > Cheers, > Albert > >> Best regards, Adam. >> >> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: >>> Dear Adam, >>> >>> Thank you for the technical information on the difference of >>> std::shared_ptr and std::unique_ptr. Just I've tried to replace >>> the vector of the pointers to that of the unique_ptrs, and >>> remove the next_xxx methods. >>> >>> I attached the revised patch (gzipped) to the master, and >>> here I quote the revised part from my last revision, for >>> convenience to take a look on it. >>> >>> Regards, >>> mpsuzuki >>> >>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h >>> index 726068a..2cedd1e 100644 >>> --- a/cpp/poppler-private.h >>> +++ b/cpp/poppler-private.h >>> @@ -72,12 +72,11 @@ class text_box_data >>> >>> { >>> >>> public: >>> text_box_data() >>> >>> - : next_text_box(0), has_space_after(false) >>> + : has_space_after(false) >>> >>> { >>> } >>> ustring text; >>> rectf bbox; >>> >>> - text_box *next_text_box; >>> >>> std::vector<rectf> char_bboxes; >>> bool has_space_after; >>> >>> }; >>> >>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h >>> index 18af213..3013997 100644 >>> --- a/cpp/poppler-page.h >>> +++ b/cpp/poppler-page.h >>> @@ -22,6 +22,8 @@ >>> >>> #include "poppler-global.h" >>> #include "poppler-rectangle.h" >>> >>> +#include <memory> >>> + >>> >>> namespace poppler >>> { >>> >>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { >>> >>> ~text_box(); >>> ustring text() const; >>> rectf bbox() const; >>> >>> - text_box *next_text_box() const; >>> - text_box *next_word() { return this->next_text_box(); }; >>> >>> rectf char_bbox(int i) const; >>> bool has_space_after() const; >>> >>> private: >>> - text_box_data* m_data; >>> + std::unique_ptr<text_box_data> m_data; >>> >>> }; >>> >>> class document; >>> >>> @@ -79,7 +79,7 @@ public: >>> ustring text(const rectf &rect = rectf()) const; >>> ustring text(const rectf &rect, text_layout_enum layout_mode) const; >>> >>> - std::vector<text_box*> text_list(rotation_enum rotation) const; >>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum >>> rotation) const;> >>> private: >>> page(document_private *doc, int index); >>> >>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp >>> index 9461ab9..5da2e1d 100644 >>> --- a/cpp/poppler-page.cpp >>> +++ b/cpp/poppler-page.cpp >>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, text_layout_enum >>> layout_mode) const >>> >>> */ >>> >>> text_box::text_box(const ustring& text, const rectf &bbox) >>> { >>> >>> - m_data = new text_box_data(); >>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); >>> >>> m_data->text = text; >>> m_data->bbox = bbox; >>> >>> } >>> >>> text_box::~text_box() >>> { >>> >>> - delete m_data; >>> >>> } >>> >>> ustring text_box::text() const >>> >>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const >>> >>> return m_data->bbox; >>> >>> } >>> >>> -text_box* text_box::next_text_box() const >>> -{ >>> - return m_data->next_text_box; >>> -} >>> - >>> >>> rectf text_box::char_bbox(int i) const >>> { >>> >>> return m_data->char_bboxes[i]; >>> >>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const >>> >>> return m_data->has_space_after; >>> >>> } >>> >>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const >>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum >>> rotate) const> >>> { >>> >>> TextOutputDev *output_dev; >>> >>> - std::vector<text_box*> output_list; >>> + std::vector<std::unique_ptr<text_box>> output_list; >>> >>> const int rotation_value = (int)rotate * 90; >>> >>> /* config values are same with Qt5 Page::TextList() */ >>> >>> @@ -366,7 +360,7 @@ std::vector<text_box*> page::text_list(rotation_enum >>> rotate) const >>> >>> double xMin, yMin, xMax, yMax; >>> word->getBBox(&xMin, &yMin, &xMax, &yMax); >>> >>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>> yMax-yMin)); + std::unique_ptr<text_box> tb = >>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>> yMax-yMin))); >>> >>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); >>> >>> tb->m_data->char_bboxes.reserve(word->getLength()); >>> >>> @@ -375,10 +369,7 @@ std::vector<text_box*> page::text_list(rotation_enum >>> rotate) const >>> >>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-xMin, >>> yMax-yMin)); >>> >>> } >>> >>> - if (output_list.size() > 0) >>> - output_list.back()->m_data->next_text_box = tb; >>> - >>> - output_list.push_back(tb); >>> + output_list.push_back(std::move(tb)); >>> >>> } >>> delete word_list; >>> delete output_dev; >>> >>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp >>> index 09b180d..d26a0ef 100644 >>> --- a/cpp/tests/poppler-dump.cpp >>> +++ b/cpp/tests/poppler-dump.cpp >>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p) >>> >>> std::cout << std::endl; >>> return; >>> >>> } >>> >>> - std::vector<poppler::text_box*> text_list = >>> p->text_list(poppler::rotate_0); + >>> std::vector<std::unique_ptr<poppler::text_box>> text_list = >>> p->text_list(poppler::rotate_0); >>> >>> std::cout << "---" << std::endl; >>> for (size_t i = 0; i < text_list.size(); i ++) { >>> >>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p) >>> >>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" << >>> >>> bbox.width() << " h=" << bbox.height() << " )"; >>> >>> std::cout << std::endl; >>> >>> - delete text_list[i]; >>> >>> } >>> std::cout << "---" << std::endl; >>> >>> } >>> >>> Adam Reichold wrote: >>>> Hello mpsuzuki, >>>> >>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 only >>>> since a relatively short time, so I guess there just was no time to use >>>> it anywhere. >>>> >>>> In any case it is a pretty heavy-handed solution due to the atomic >>>> memory accesses on the reference counter which will slow down even >>>> single-threaded programs which do not need it. Hence I would suggest >>>> starting with std::unique_ptr to get a move-only type. >>>> >>>> Best regards, Adam. >>>> >>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: >>>>> Dear Albert, Adam, >>>>> >>>>> Thank you for the helpful review. I should have more >>>>> training on C++... Please let me spend to improve the >>>>> patch. >>>>> >>>>>> (Another easy option albeit with some >>>>>> overhead to get automatic destruction would be to wrap the actual data >>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>> Yeah, from some posts in stackoverflow, I found about >>>>> shared_ptr... But it seems that current poppler does >>>>> not use shared_ptr at all. Is this a coding policy? >>>>> >>>>> Regards, >>>>> mpsuzuki >>>>> >>>>> Adam Reichold wrote: >>>>>> Hello >>>>>> >>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: >>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki toshiya >>>>>>> va >>>>>>> >>>>>>> escriure: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I've tried to implement the suggestion, I attached my current patch. >>>>>>>> >>>>>>>> As suggested, the most part is just copied from Qt frontend and >>>>>>>> renamed, >>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing, >>>>>>>> because the returned object is a pointer to TextBox. I wrote >>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which >>>>>>>> calls next_text_box() internally. >>>>>>>> >>>>>>>> Another point I want to discuss is the design of the list give by >>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() returns >>>>>>>> QList<TextBox*>. For similarity, current patch returns >>>>>>>> std::vector<text_box*> for similarity to Qt frontend. >>>>>>>> >>>>>>>> But, if we return the vector of pointers, the client should destruct >>>>>>>> the objects pointed by the vector, before destructing vector itself. >>>>>>>> Using a vector of text_box (not the pointer but the object itself), >>>>>>>> like std::vector<text_box>, could be better, because the destructor >>>>>>>> of the vector would internally call the destructor for text_box >>>>>>>> object. >>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have such). >>>>>>>> If I'm misunderstanding about C++, please correct. >>>>>>> First you need to fix the text_box class, you either need to make it >>>>>>> uncopiable (and thus making your suggestion not possible) or you need >>>>>>> to make the copy and assignment operators not break your class since >>>>>>> at the moment the default operators will just copy the pointer and >>>>>>> everything will break because you will end up with a dangling >>>>>>> pointer. >>>>>>> >>>>>>> If you go for the second option you'll want the move constructor and >>>>>>> move >>>>>>> operator too. >>>>>> Should the first option also be viable if text_box becomes a move-only >>>>>> type, i.e. no copying but move construction and assignment? I think >>>>>> std::vector can handle it. (Another easy option albeit with some >>>>>> overhead to get automatic destruction would be to wrap the actual data >>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>> >>>>>> Best regards, Adam. >>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Albert >>>>>>>> Regards, >>>>>>>> mpsuzuki >>>>>>>> >>>>>>>> Albert Astals Cid wrote: >>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen Ooms >>>>>>>>> va >>>>>>>>> >>>>>>>>> escriure: >>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf >>>>>>>>>> document, >>>>>>>>>> including the position of each text box? Currently we use >>>>>>>>>> page->text() >>>>>>>>>> with page::physical_layout which gives all text per page, but I >>>>>>>>>> need >>>>>>>>>> more detailed information about each text box per page. >>>>>>>>> You want to code the variant of qt5 frontend >>>>>>>>> Poppler::Page::textList() for >>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration (i.e. >>>>>>>>> almost-copying) the code, do you have time for it? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Albert >>>>>>>>>> _______________________________________________ >>>>>>>>>> poppler mailing list >>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>> _______________________________________________ >>>>>>>>> poppler mailing list >>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>> _______________________________________________ >>>>>>> poppler mailing list >>>>>>> poppler@lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler