Hello Alexey Serbin, Kudu Jenkins, Andrew Wong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/14572
to look at the new patch set (#3).
Change subject: thirdparty: add gumbo and gumbo-query
......................................................................
thirdparty: add gumbo and gumbo-query
In a follow-on patch I built a simple web crawler for testing the web UI.
It's possible to parse HTML for links via string::find, it made for some
really ugly and brittle code. So I went searching for a proper HTML parser.
I settled on Gumbo (known as gumbo-parser on github), a Google C library for
parsing HTML 5. Although it is quite old and hasn't been updated in some
time, it has been used on Google's web cache and has passed Google's
internal security review. Plus it has an intuitive API.
To simplify further, I incorporated the gumbo-query C++ library, which adds
a CSS selector [2] API for Gumbo. This drastically simplifies an operation
like finding all the links in a page. Sample code:
string page;
gq::CDocument doc;
doc.parse(page);
gq::CSelection sel = doc.find("a");
for (int i = 0; i < sel.nodeNum(); i++) {
string link = sel.nodeAt(i).attribute("href");
<do stuff with link>
}
Like Gumbo, gumbo-query is old and unmaintained. I had to patch it to
move all of its functionality into a namespace.
While adding the new licenses to thirdparty/LICENSE.txt, I did a bit of
cleanup. The only substantive change was moving curl to the "build-time"
dependencies section; it's not part of the source or binary distribution.
1.
https://opensource.googleblog.com/2013/08/gumbo-c-library-for-parsing-html.html
2. https://www.w3schools.com/cssref/css_selectors.asp
Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
---
M CMakeLists.txt
A cmake_modules/FindGumboParser.cmake
A cmake_modules/FindGumboQuery.cmake
M thirdparty/LICENSE.txt
M thirdparty/build-definitions.sh
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
A thirdparty/patches/gumbo-parser-autoconf-263.patch
A thirdparty/patches/gumbo-query-namespace.patch
M thirdparty/vars.sh
10 files changed, 568 insertions(+), 57 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/14572/3
--
To view, visit http://gerrit.cloudera.org:8080/14572
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I851d5a78e30c1c79a2de253f59c7377a61aa3ed7
Gerrit-Change-Number: 14572
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)