David K. Storrs wrote on 12/09/2015 08:50 PM:
1) Is there a web-spidering package that people recommend?  I could use wget 
and then parse things from disk, but I'd like to have something that's easily 
composable into CLI scripts.


I've done a lot of Web crawling and scraping successfully with Racket and Scheme, over the last 14-15 years. I released an HTML parser ("http://www.neilvandyke.org/racket-html-parsing/";), which I still use today. From that parse, you might then extract the info you need with `sxml-match` ("http://planet.racket-lang.org/display.ss?package=sxml-match.plt&owner=jim";) and/or SXPath. For HTTP, the client modules in Racket are often satisfactory, and other times I've used my own packages that implement HTTP in pure Racket or that wrap `curl` or `wget` for special requirements. For storing pages and links/metadata, there's the filesystem, the core Racket RDBMS database support, and cloud stores like AWS S3. The un-AJAX-ing and site-specific scraping behavior you might have to do yourself, if you need it. (I have a backlog of related tools to release someday.)

P.S., Fortunately, the `sxml-match` Racket package has been preserved on the official Racket PLaneT package server, :) since the author's Web site with the package home page is down/disappeared.

Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to