El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va escriure: > Dear Albert, > > Oops, I apologize that I overlooked your response to > improve the patch. > > > You have one text_box where you have 17 char bboxes but the > > ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash > > because you get an out of bounds. > Indeed, it might be common scenarios for the > programmers assuming the input as ASCII data but > something different is given. > > > So it would be nice to make > > rectf char_bbox(int i) const; > > not crash on an invalid i, i guess return a 0x0 rect > > and then probably add some documentation about it? > > Checking the features of C++ std::vector, I think > we have 2 options: > > a) use (as current patch) m_data->char_bbox[i] and > sanitize the i-index before using it (as you proposed), > and the bad values cause 0x0 rectf. > > b) use m_data->char_bbox.at(i) and let stdlib raise > std::out_of_range error. > > which is better for C++ programmers?
As far as i can see we don't use exceptions anywhere in the poppler code, so i would not introduce them "just for this". Cheers, Albert > > > FWIW this was a real problem in Okular and i fixed it with > > https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cp > > p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > > > > Am i making any sense here? > > Thank you very much for the important suggestion and > sorry for delayed reponse! > > Regards, > mpsuzuki > > Albert Astals Cid wrote: > > El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va escriure: > >> Hello mpsuzuki, > >> > >> I do not think you need the two indirections: Since you want to keep > >> that m_data indirection for ABI compatibility, just return > >> std::vector<text_box> as text_box basically "is" a unique_ptr to its > >> data with the additional getter methods. (You just need to make sure to > >> default the move constructor and assignment operator since no copying > >> implies no moving by default.) > >> > >> Attached is an update patch along those lines. (I also took the liberty > >> to replace some bare deletes by use of unique_ptr in the implementation > >> of page::text_list as well.) > > > > One thing that i wanted to improve in the Qt5 frontend but never go to, > > maybe we can start here is the fact that the number of boxes is "a bit > > random", for example take https://bugs.kde.org/attachment.cgi?id=66189 > > > > You have one text_box where you have 17 char bboxes but the > > ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash > > because you get an out of bounds. > > > > So it would be nice to make > > rectf char_bbox(int i) const; > > not crash on an invalid i, i guess return a 0x0 rect > > and then probably add some documentation about it? > > > > FWIW this was a real problem in Okular and i fixed it with > > https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cp > > p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > > > > Am i making any sense here? > > > > Cheers, > > > > Albert > >> > >> Best regards, Adam. > >> > >> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: > >>> Dear Adam, > >>> > >>> Thank you for the technical information on the difference of > >>> std::shared_ptr and std::unique_ptr. Just I've tried to replace > >>> the vector of the pointers to that of the unique_ptrs, and > >>> remove the next_xxx methods. > >>> > >>> I attached the revised patch (gzipped) to the master, and > >>> here I quote the revised part from my last revision, for > >>> convenience to take a look on it. > >>> > >>> Regards, > >>> mpsuzuki > >>> > >>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h > >>> index 726068a..2cedd1e 100644 > >>> --- a/cpp/poppler-private.h > >>> +++ b/cpp/poppler-private.h > >>> @@ -72,12 +72,11 @@ class text_box_data > >>> > >>> { > >>> > >>> public: > >>> text_box_data() > >>> > >>> - : next_text_box(0), has_space_after(false) > >>> + : has_space_after(false) > >>> > >>> { > >>> } > >>> ustring text; > >>> rectf bbox; > >>> > >>> - text_box *next_text_box; > >>> > >>> std::vector<rectf> char_bboxes; > >>> bool has_space_after; > >>> > >>> }; > >>> > >>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h > >>> index 18af213..3013997 100644 > >>> --- a/cpp/poppler-page.h > >>> +++ b/cpp/poppler-page.h > >>> @@ -22,6 +22,8 @@ > >>> > >>> #include "poppler-global.h" > >>> #include "poppler-rectangle.h" > >>> > >>> +#include <memory> > >>> + > >>> > >>> namespace poppler > >>> { > >>> > >>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { > >>> > >>> ~text_box(); > >>> ustring text() const; > >>> rectf bbox() const; > >>> > >>> - text_box *next_text_box() const; > >>> - text_box *next_word() { return this->next_text_box(); }; > >>> > >>> rectf char_bbox(int i) const; > >>> bool has_space_after() const; > >>> > >>> private: > >>> - text_box_data* m_data; > >>> + std::unique_ptr<text_box_data> m_data; > >>> > >>> }; > >>> > >>> class document; > >>> > >>> @@ -79,7 +79,7 @@ public: > >>> ustring text(const rectf &rect = rectf()) const; > >>> ustring text(const rectf &rect, text_layout_enum layout_mode) > >>> const; > >>> > >>> - std::vector<text_box*> text_list(rotation_enum rotation) const; > >>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum > >>> rotation) const;> > >>> > >>> private: > >>> page(document_private *doc, int index); > >>> > >>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp > >>> index 9461ab9..5da2e1d 100644 > >>> --- a/cpp/poppler-page.cpp > >>> +++ b/cpp/poppler-page.cpp > >>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, > >>> text_layout_enum > >>> layout_mode) const > >>> > >>> */ > >>> > >>> text_box::text_box(const ustring& text, const rectf &bbox) > >>> { > >>> > >>> - m_data = new text_box_data(); > >>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); > >>> > >>> m_data->text = text; > >>> m_data->bbox = bbox; > >>> > >>> } > >>> > >>> text_box::~text_box() > >>> { > >>> > >>> - delete m_data; > >>> > >>> } > >>> > >>> ustring text_box::text() const > >>> > >>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const > >>> > >>> return m_data->bbox; > >>> > >>> } > >>> > >>> -text_box* text_box::next_text_box() const > >>> -{ > >>> - return m_data->next_text_box; > >>> -} > >>> - > >>> > >>> rectf text_box::char_bbox(int i) const > >>> { > >>> > >>> return m_data->char_bboxes[i]; > >>> > >>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const > >>> > >>> return m_data->has_space_after; > >>> > >>> } > >>> > >>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const > >>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum > >>> rotate) const> > >>> > >>> { > >>> > >>> TextOutputDev *output_dev; > >>> > >>> - std::vector<text_box*> output_list; > >>> + std::vector<std::unique_ptr<text_box>> output_list; > >>> > >>> const int rotation_value = (int)rotate * 90; > >>> > >>> /* config values are same with Qt5 Page::TextList() */ > >>> > >>> @@ -366,7 +360,7 @@ std::vector<text_box*> page::text_list(rotation_enum > >>> rotate) const > >>> > >>> double xMin, yMin, xMax, yMax; > >>> word->getBBox(&xMin, &yMin, &xMax, &yMax); > >>> > >>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, > >>> yMax-yMin)); + std::unique_ptr<text_box> tb = > >>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, > >>> xMax-xMin, > >>> yMax-yMin))); > >>> > >>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); > >>> > >>> tb->m_data->char_bboxes.reserve(word->getLength()); > >>> > >>> @@ -375,10 +369,7 @@ std::vector<text_box*> > >>> page::text_list(rotation_enum > >>> rotate) const > >>> > >>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-xMin, > >>> yMax-yMin)); > >>> > >>> } > >>> > >>> - if (output_list.size() > 0) > >>> - output_list.back()->m_data->next_text_box = tb; > >>> - > >>> - output_list.push_back(tb); > >>> + output_list.push_back(std::move(tb)); > >>> > >>> } > >>> delete word_list; > >>> delete output_dev; > >>> > >>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp > >>> index 09b180d..d26a0ef 100644 > >>> --- a/cpp/tests/poppler-dump.cpp > >>> +++ b/cpp/tests/poppler-dump.cpp > >>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p) > >>> > >>> std::cout << std::endl; > >>> return; > >>> > >>> } > >>> > >>> - std::vector<poppler::text_box*> text_list = > >>> p->text_list(poppler::rotate_0); + > >>> std::vector<std::unique_ptr<poppler::text_box>> text_list = > >>> p->text_list(poppler::rotate_0); > >>> > >>> std::cout << "---" << std::endl; > >>> for (size_t i = 0; i < text_list.size(); i ++) { > >>> > >>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p) > >>> > >>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" > >>> << > >>> > >>> bbox.width() << " h=" << bbox.height() << " )"; > >>> > >>> std::cout << std::endl; > >>> > >>> - delete text_list[i]; > >>> > >>> } > >>> std::cout << "---" << std::endl; > >>> > >>> } > >>> > >>> Adam Reichold wrote: > >>>> Hello mpsuzuki, > >>>> > >>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 only > >>>> since a relatively short time, so I guess there just was no time to use > >>>> it anywhere. > >>>> > >>>> In any case it is a pretty heavy-handed solution due to the atomic > >>>> memory accesses on the reference counter which will slow down even > >>>> single-threaded programs which do not need it. Hence I would suggest > >>>> starting with std::unique_ptr to get a move-only type. > >>>> > >>>> Best regards, Adam. > >>>> > >>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: > >>>>> Dear Albert, Adam, > >>>>> > >>>>> Thank you for the helpful review. I should have more > >>>>> training on C++... Please let me spend to improve the > >>>>> patch. > >>>>> > >>>>>> (Another easy option albeit with some > >>>>>> overhead to get automatic destruction would be to wrap the actual > >>>>>> data > >>>>>> inside a std::shared_ptr and hence use reference counting.) > >>>>> > >>>>> Yeah, from some posts in stackoverflow, I found about > >>>>> shared_ptr... But it seems that current poppler does > >>>>> not use shared_ptr at all. Is this a coding policy? > >>>>> > >>>>> Regards, > >>>>> mpsuzuki > >>>>> > >>>>> Adam Reichold wrote: > >>>>>> Hello > >>>>>> > >>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: > >>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki > >>>>>>> toshiya > >>>>>>> va > >>>>>>> > >>>>>>> escriure: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I've tried to implement the suggestion, I attached my current > >>>>>>>> patch. > >>>>>>>> > >>>>>>>> As suggested, the most part is just copied from Qt frontend and > >>>>>>>> renamed, > >>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing, > >>>>>>>> because the returned object is a pointer to TextBox. I wrote > >>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which > >>>>>>>> calls next_text_box() internally. > >>>>>>>> > >>>>>>>> Another point I want to discuss is the design of the list give by > >>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() > >>>>>>>> returns > >>>>>>>> QList<TextBox*>. For similarity, current patch returns > >>>>>>>> std::vector<text_box*> for similarity to Qt frontend. > >>>>>>>> > >>>>>>>> But, if we return the vector of pointers, the client should > >>>>>>>> destruct > >>>>>>>> the objects pointed by the vector, before destructing vector > >>>>>>>> itself. > >>>>>>>> Using a vector of text_box (not the pointer but the object itself), > >>>>>>>> like std::vector<text_box>, could be better, because the destructor > >>>>>>>> of the vector would internally call the destructor for text_box > >>>>>>>> object. > >>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have such). > >>>>>>>> If I'm misunderstanding about C++, please correct. > >>>>>>> > >>>>>>> First you need to fix the text_box class, you either need to make it > >>>>>>> uncopiable (and thus making your suggestion not possible) or you > >>>>>>> need > >>>>>>> to make the copy and assignment operators not break your class since > >>>>>>> at the moment the default operators will just copy the pointer and > >>>>>>> everything will break because you will end up with a dangling > >>>>>>> pointer. > >>>>>>> > >>>>>>> If you go for the second option you'll want the move constructor and > >>>>>>> move > >>>>>>> operator too. > >>>>>> > >>>>>> Should the first option also be viable if text_box becomes a > >>>>>> move-only > >>>>>> type, i.e. no copying but move construction and assignment? I think > >>>>>> std::vector can handle it. (Another easy option albeit with some > >>>>>> overhead to get automatic destruction would be to wrap the actual > >>>>>> data > >>>>>> inside a std::shared_ptr and hence use reference counting.) > >>>>>> > >>>>>> Best regards, Adam. > >>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Albert > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> mpsuzuki > >>>>>>>> > >>>>>>>> Albert Astals Cid wrote: > >>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen > >>>>>>>>> Ooms > >>>>>>>>> va > >>>>>>>>> > >>>>>>>>> escriure: > >>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf > >>>>>>>>>> document, > >>>>>>>>>> including the position of each text box? Currently we use > >>>>>>>>>> page->text() > >>>>>>>>>> with page::physical_layout which gives all text per page, but I > >>>>>>>>>> need > >>>>>>>>>> more detailed information about each text box per page. > >>>>>>>>> > >>>>>>>>> You want to code the variant of qt5 frontend > >>>>>>>>> Poppler::Page::textList() for > >>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration (i.e. > >>>>>>>>> almost-copying) the code, do you have time for it? > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> > >>>>>>>>> Albert > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> poppler mailing list > >>>>>>>>>> poppler@lists.freedesktop.org > >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> poppler mailing list > >>>>>>>>> poppler@lists.freedesktop.org > >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> poppler mailing list > >>>>>>> poppler@lists.freedesktop.org > >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler