[julia-users] ANN: Gumbo.jl (HTML parsing library)

James Porter Thu, 19 Jun 2014 08:01:48 -0700

Hi All—

A while back I was working on a webcrawler and I realized we didn't have a 
Julia HTML parser. I also wanted to learn how to wrap C libraries, so I 
started working on a wrapper around google's gumbo 
<https://github.com/google/gumbo-parser> library for parsing HTML. The 
result, Gumbo.jl, can be found here 
<https://github.com/porterjamesj/Gumbo.jl>. It's by no means production 
ready but I am reasonably happy with the API and I would love for others to 
do some tire kicking and send feedback, bug reports, etc.


Major thanks to Tony Kelman for helping me whip the build script into shape 
on IRC last night. It *should* build correctly on a Unix system with 
autotools, please file a bug if the build doesn't work for you.

Some things that still need doing if anyone wants to help:

   - support windows. If someone wants to build and test binaries of the 
   gumbo dll, I'm happy to host them and add them to the build script.
   - support CDATA, just haven't gotten around to it yet.
   - performance improvements. I am certainly being very wasteful with 
   memory when translating gumbo's output to Julia types.

I would also love to get general code review from others who write these 
sorts of packages; feedback on the API, etc., so please try it out and let 
me know what you think!

Cheers,
James

[julia-users] ANN: Gumbo.jl (HTML parsing library)

Reply via email to