El dijous, 8 de febrer de 2018, a les 4:17:33 CET, suzuki toshiya va escriure: > Dear Albert, > > > Can you please try adding some documentation in the cpp file like the > > other > > functions have? > > Oh, sorry for overlooking that task. Here it is...
Commited. > > Regards, > mpsuzuki > > Albert Astals Cid wrote: > > El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va > > > > escriure: > >> Dear Albert, > >> > >>> As far as i can see we don't use exceptions anywhere in the poppler > >>> code, > >>> so i would not introduce them "just for this". > >> > >> Oh, I'm sorry for my overlooking that. Here is updated patch which checks > >> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index > >> is passed. Also I changed the type of the index from int to size_t. > > > > Can you please try adding some documentation in the cpp file like the > > other > > functions have? > > > > Cheers, > > > > Albert > >> > >> Regards, > >> mpsuzuki > >> > >> Albert Astals Cid wrote: > >>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va > >>> > >>> escriure: > >>>> Dear Albert, > >>>> > >>>> Oops, I apologize that I overlooked your response to > >>>> improve the patch. > >>>> > >>>>> You have one text_box where you have 17 char bboxes but the > >>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a > >>>>> crash > >>>>> because you get an out of bounds. > >>>> > >>>> Indeed, it might be common scenarios for the > >>>> programmers assuming the input as ASCII data but > >>>> something different is given. > >>>> > >>>>> So it would be nice to make > >>>>> rectf char_bbox(int i) const; > >>>>> not crash on an invalid i, i guess return a 0x0 rect > >>>>> and then probably add some documentation about it? > >>>> > >>>> Checking the features of C++ std::vector, I think > >>>> we have 2 options: > >>>> > >>>> a) use (as current patch) m_data->char_bbox[i] and > >>>> sanitize the i-index before using it (as you proposed), > >>>> and the bad values cause 0x0 rectf. > >>>> > >>>> b) use m_data->char_bbox.at(i) and let stdlib raise > >>>> std::out_of_range error. > >>>> > >>>> which is better for C++ programmers? > >>> > >>> As far as i can see we don't use exceptions anywhere in the poppler > >>> code, > >>> so i would not introduce them "just for this". > >>> > >>> Cheers, > >>> > >>> Albert > >>> > >>>>> FWIW this was a real problem in Okular and i fixed it with > >>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pd > >>>>> f. > >>>>> cp > >>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > >>>>> > >>>>> Am i making any sense here? > >>>> > >>>> Thank you very much for the important suggestion and > >>>> sorry for delayed reponse! > >>>> > >>>> Regards, > >>>> mpsuzuki > >>>> > >>>> Albert Astals Cid wrote: > >>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va > >>> > >>> escriure: > >>>>>> Hello mpsuzuki, > >>>>>> > >>>>>> I do not think you need the two indirections: Since you want to keep > >>>>>> that m_data indirection for ABI compatibility, just return > >>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its > >>>>>> data with the additional getter methods. (You just need to make sure > >>>>>> to > >>>>>> default the move constructor and assignment operator since no copying > >>>>>> implies no moving by default.) > >>>>>> > >>>>>> Attached is an update patch along those lines. (I also took the > >>>>>> liberty > >>>>>> to replace some bare deletes by use of unique_ptr in the > >>>>>> implementation > >>>>>> of page::text_list as well.) > >>>>> > >>>>> One thing that i wanted to improve in the Qt5 frontend but never go > >>>>> to, > >>>>> maybe we can start here is the fact that the number of boxes is "a bit > >>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189 > >>>>> > >>>>> You have one text_box where you have 17 char bboxes but the > >>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a > >>>>> crash > >>>>> because you get an out of bounds. > >>>>> > >>>>> So it would be nice to make > >>>>> rectf char_bbox(int i) const; > >>>>> not crash on an invalid i, i guess return a 0x0 rect > >>>>> and then probably add some documentation about it? > >>>>> > >>>>> FWIW this was a real problem in Okular and i fixed it with > >>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pd > >>>>> f. > >>>>> cp > >>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 > >>>>> > >>>>> Am i making any sense here? > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Albert > >>>>>> > >>>>>> Best regards, Adam. > >>>>>> > >>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: > >>>>>>> Dear Adam, > >>>>>>> > >>>>>>> Thank you for the technical information on the difference of > >>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace > >>>>>>> the vector of the pointers to that of the unique_ptrs, and > >>>>>>> remove the next_xxx methods. > >>>>>>> > >>>>>>> I attached the revised patch (gzipped) to the master, and > >>>>>>> here I quote the revised part from my last revision, for > >>>>>>> convenience to take a look on it. > >>>>>>> > >>>>>>> Regards, > >>>>>>> mpsuzuki > >>>>>>> > >>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h > >>>>>>> index 726068a..2cedd1e 100644 > >>>>>>> --- a/cpp/poppler-private.h > >>>>>>> +++ b/cpp/poppler-private.h > >>>>>>> @@ -72,12 +72,11 @@ class text_box_data > >>>>>>> > >>>>>>> { > >>>>>>> > >>>>>>> public: > >>>>>>> text_box_data() > >>>>>>> > >>>>>>> - : next_text_box(0), has_space_after(false) > >>>>>>> + : has_space_after(false) > >>>>>>> > >>>>>>> { > >>>>>>> } > >>>>>>> ustring text; > >>>>>>> rectf bbox; > >>>>>>> > >>>>>>> - text_box *next_text_box; > >>>>>>> > >>>>>>> std::vector<rectf> char_bboxes; > >>>>>>> bool has_space_after; > >>>>>>> > >>>>>>> }; > >>>>>>> > >>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h > >>>>>>> index 18af213..3013997 100644 > >>>>>>> --- a/cpp/poppler-page.h > >>>>>>> +++ b/cpp/poppler-page.h > >>>>>>> @@ -22,6 +22,8 @@ > >>>>>>> > >>>>>>> #include "poppler-global.h" > >>>>>>> #include "poppler-rectangle.h" > >>>>>>> > >>>>>>> +#include <memory> > >>>>>>> + > >>>>>>> > >>>>>>> namespace poppler > >>>>>>> { > >>>>>>> > >>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { > >>>>>>> > >>>>>>> ~text_box(); > >>>>>>> ustring text() const; > >>>>>>> rectf bbox() const; > >>>>>>> > >>>>>>> - text_box *next_text_box() const; > >>>>>>> - text_box *next_word() { return this->next_text_box(); }; > >>>>>>> > >>>>>>> rectf char_bbox(int i) const; > >>>>>>> bool has_space_after() const; > >>>>>>> > >>>>>>> private: > >>>>>>> - text_box_data* m_data; > >>>>>>> + std::unique_ptr<text_box_data> m_data; > >>>>>>> > >>>>>>> }; > >>>>>>> > >>>>>>> class document; > >>>>>>> > >>>>>>> @@ -79,7 +79,7 @@ public: > >>>>>>> ustring text(const rectf &rect = rectf()) const; > >>>>>>> ustring text(const rectf &rect, text_layout_enum layout_mode) > >>>>>>> const; > >>>>>>> > >>>>>>> - std::vector<text_box*> text_list(rotation_enum rotation) const; > >>>>>>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum > >>>>>>> rotation) const;> > >>>>>>> > >>>>>>> private: > >>>>>>> page(document_private *doc, int index); > >>>>>>> > >>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp > >>>>>>> index 9461ab9..5da2e1d 100644 > >>>>>>> --- a/cpp/poppler-page.cpp > >>>>>>> +++ b/cpp/poppler-page.cpp > >>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, > >>>>>>> text_layout_enum > >>>>>>> layout_mode) const > >>>>>>> > >>>>>>> */ > >>>>>>> > >>>>>>> text_box::text_box(const ustring& text, const rectf &bbox) > >>>>>>> { > >>>>>>> > >>>>>>> - m_data = new text_box_data(); > >>>>>>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); > >>>>>>> > >>>>>>> m_data->text = text; > >>>>>>> m_data->bbox = bbox; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> text_box::~text_box() > >>>>>>> { > >>>>>>> > >>>>>>> - delete m_data; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> ustring text_box::text() const > >>>>>>> > >>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const > >>>>>>> > >>>>>>> return m_data->bbox; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> -text_box* text_box::next_text_box() const > >>>>>>> -{ > >>>>>>> - return m_data->next_text_box; > >>>>>>> -} > >>>>>>> - > >>>>>>> > >>>>>>> rectf text_box::char_bbox(int i) const > >>>>>>> { > >>>>>>> > >>>>>>> return m_data->char_bboxes[i]; > >>>>>>> > >>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const > >>>>>>> > >>>>>>> return m_data->has_space_after; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const > >>>>>>> +std::vector<std::unique_ptr<text_box>> > >>>>>>> page::text_list(rotation_enum > >>>>>>> rotate) const> > >>>>>>> > >>>>>>> { > >>>>>>> > >>>>>>> TextOutputDev *output_dev; > >>>>>>> > >>>>>>> - std::vector<text_box*> output_list; > >>>>>>> + std::vector<std::unique_ptr<text_box>> output_list; > >>>>>>> > >>>>>>> const int rotation_value = (int)rotate * 90; > >>>>>>> > >>>>>>> /* config values are same with Qt5 Page::TextList() */ > >>>>>>> > >>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*> > >>>>>>> page::text_list(rotation_enum > >>>>>>> rotate) const > >>>>>>> > >>>>>>> double xMin, yMin, xMax, yMax; > >>>>>>> word->getBBox(&xMin, &yMin, &xMax, &yMax); > >>>>>>> > >>>>>>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, > >>>>>>> yMax-yMin)); + std::unique_ptr<text_box> tb = > >>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, > >>>>>>> xMax-xMin, > >>>>>>> yMax-yMin))); > >>>>>>> > >>>>>>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); > >>>>>>> > >>>>>>> tb->m_data->char_bboxes.reserve(word->getLength()); > >>>>>>> > >>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*> > >>>>>>> page::text_list(rotation_enum > >>>>>>> rotate) const > >>>>>>> > >>>>>>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax- > > > > xMin, > > > >>>>>>> yMax-yMin)); > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> - if (output_list.size() > 0) > >>>>>>> - output_list.back()->m_data->next_text_box = tb; > >>>>>>> - > >>>>>>> - output_list.push_back(tb); > >>>>>>> + output_list.push_back(std::move(tb)); > >>>>>>> > >>>>>>> } > >>>>>>> delete word_list; > >>>>>>> delete output_dev; > >>>>>>> > >>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp > >>>>>>> index 09b180d..d26a0ef 100644 > >>>>>>> --- a/cpp/tests/poppler-dump.cpp > >>>>>>> +++ b/cpp/tests/poppler-dump.cpp > >>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page > >>>>>>> *p) > >>>>>>> > >>>>>>> std::cout << std::endl; > >>>>>>> return; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> - std::vector<poppler::text_box*> text_list = > >>>>>>> p->text_list(poppler::rotate_0); + > >>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list = > >>>>>>> p->text_list(poppler::rotate_0); > >>>>>>> > >>>>>>> std::cout << "---" << std::endl; > >>>>>>> for (size_t i = 0; i < text_list.size(); i ++) { > >>>>>>> > >>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page > >>>>>>> *p) > >>>>>>> > >>>>>>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " > >>>>>>> w=" > >>>>>>> << > >>>>>>> > >>>>>>> bbox.width() << " h=" << bbox.height() << " )"; > >>>>>>> > >>>>>>> std::cout << std::endl; > >>>>>>> > >>>>>>> - delete text_list[i]; > >>>>>>> > >>>>>>> } > >>>>>>> std::cout << "---" << std::endl; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> Adam Reichold wrote: > >>>>>>>> Hello mpsuzuki, > >>>>>>>> > >>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 > >>>>>>>> only > >>>>>>>> since a relatively short time, so I guess there just was no time to > >>>>>>>> use > >>>>>>>> it anywhere. > >>>>>>>> > >>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic > >>>>>>>> memory accesses on the reference counter which will slow down even > >>>>>>>> single-threaded programs which do not need it. Hence I would > >>>>>>>> suggest > >>>>>>>> starting with std::unique_ptr to get a move-only type. > >>>>>>>> > >>>>>>>> Best regards, Adam. > >>>>>>>> > >>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: > >>>>>>>>> Dear Albert, Adam, > >>>>>>>>> > >>>>>>>>> Thank you for the helpful review. I should have more > >>>>>>>>> training on C++... Please let me spend to improve the > >>>>>>>>> patch. > >>>>>>>>> > >>>>>>>>>> (Another easy option albeit with some > >>>>>>>>>> overhead to get automatic destruction would be to wrap the actual > >>>>>>>>>> data > >>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) > >>>>>>>>> > >>>>>>>>> Yeah, from some posts in stackoverflow, I found about > >>>>>>>>> shared_ptr... But it seems that current poppler does > >>>>>>>>> not use shared_ptr at all. Is this a coding policy? > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> mpsuzuki > >>>>>>>>> > >>>>>>>>> Adam Reichold wrote: > >>>>>>>>>> Hello > >>>>>>>>>> > >>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: > >>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki > >>>>>>>>>>> toshiya > >>>>>>>>>>> va > >>>>>>>>>>> > >>>>>>>>>>> escriure: > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I've tried to implement the suggestion, I attached my current > >>>>>>>>>>>> patch. > >>>>>>>>>>>> > >>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and > >>>>>>>>>>>> renamed, > >>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly > >>>>>>>>>>>> confusing, > >>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote > >>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which > >>>>>>>>>>>> calls next_text_box() internally. > >>>>>>>>>>>> > >>>>>>>>>>>> Another point I want to discuss is the design of the list give > >>>>>>>>>>>> by > >>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() > >>>>>>>>>>>> returns > >>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns > >>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend. > >>>>>>>>>>>> > >>>>>>>>>>>> But, if we return the vector of pointers, the client should > >>>>>>>>>>>> destruct > >>>>>>>>>>>> the objects pointed by the vector, before destructing vector > >>>>>>>>>>>> itself. > >>>>>>>>>>>> Using a vector of text_box (not the pointer but the object > >>>>>>>>>>>> itself), > >>>>>>>>>>>> like std::vector<text_box>, could be better, because the > >>>>>>>>>>>> destructor > >>>>>>>>>>>> of the vector would internally call the destructor for text_box > >>>>>>>>>>>> object. > >>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have > >>>>>>>>>>>> such). > >>>>>>>>>>>> If I'm misunderstanding about C++, please correct. > >>>>>>>>>>> > >>>>>>>>>>> First you need to fix the text_box class, you either need to > >>>>>>>>>>> make > >>>>>>>>>>> it > >>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you > >>>>>>>>>>> need > >>>>>>>>>>> to make the copy and assignment operators not break your class > >>>>>>>>>>> since > >>>>>>>>>>> at the moment the default operators will just copy the pointer > >>>>>>>>>>> and > >>>>>>>>>>> everything will break because you will end up with a dangling > >>>>>>>>>>> pointer. > >>>>>>>>>>> > >>>>>>>>>>> If you go for the second option you'll want the move constructor > >>>>>>>>>>> and > >>>>>>>>>>> move > >>>>>>>>>>> operator too. > >>>>>>>>>> > >>>>>>>>>> Should the first option also be viable if text_box becomes a > >>>>>>>>>> move-only > >>>>>>>>>> type, i.e. no copying but move construction and assignment? I > >>>>>>>>>> think > >>>>>>>>>> std::vector can handle it. (Another easy option albeit with some > >>>>>>>>>> overhead to get automatic destruction would be to wrap the actual > >>>>>>>>>> data > >>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) > >>>>>>>>>> > >>>>>>>>>> Best regards, Adam. > >>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> > >>>>>>>>>>> Albert > >>>>>>>>>>>> > >>>>>>>>>>>> Regards, > >>>>>>>>>>>> mpsuzuki > >>>>>>>>>>>> > >>>>>>>>>>>> Albert Astals Cid wrote: > >>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, > >>>>>>>>>>>>> Jeroen > >>>>>>>>>>>>> Ooms > >>>>>>>>>>>>> va > >>>>>>>>>>>>> > >>>>>>>>>>>>> escriure: > >>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf > >>>>>>>>>>>>>> document, > >>>>>>>>>>>>>> including the position of each text box? Currently we use > >>>>>>>>>>>>>> page->text() > >>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but > >>>>>>>>>>>>>> I > >>>>>>>>>>>>>> need > >>>>>>>>>>>>>> more detailed information about each text box per page. > >>>>>>>>>>>>> > >>>>>>>>>>>>> You want to code the variant of qt5 frontend > >>>>>>>>>>>>> Poppler::Page::textList() for > >>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration > >>>>>>>>>>>>> (i.e. > >>>>>>>>>>>>> almost-copying) the code, do you have time for it? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Albert > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> poppler mailing list > >>>>>>>>>>>>>> poppler@lists.freedesktop.org > >>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> poppler mailing list > >>>>>>>>>>>>> poppler@lists.freedesktop.org > >>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> poppler mailing list > >>>>>>>>>>> poppler@lists.freedesktop.org > >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > > poppler mailing list > > poppler@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler