On 9/23/19 12:36 PM, Bobby Casey wrote:
> I love the idea of this and would contribute if I had some time and energy.
> I've written a few Python scrapers before, but those were all years ago and 
> one-offs.

I expect that these are going to generally be one-offs as well....

> If you could share some details about how you've developed the ones you're 
> presently using

I start off by writing a script that just fetches whatever web pages look 
relevant
into a revision-controlled directory; pretty-prints the HTML with "hxnormalize 
-x"
(some of these things are inscrutable without going through that step),
runs a vc diff to stdout, and then commits any new/changed files.

Then set up a cron job to run that script every few hours (and sends me e-mail
whenever the "vc diff to stdout" cited above actually shows a difference).

At that point, while I'm waiting for the `raw' diff-notices, I can look through
the HTML structures and try to identify the parts that I need and
and figure out what the relevant bits of text/data are and how to summarize them
(web.archive.org can be helpful looking for past revisions of the pages,
 if it seems likely that there's something that's just not represented
 in the samples I've got so far).

So then I write code to *summarize* whatever "status items" are described in 
the source material,
into single-line versions (like "NOW PLAYING: ...", "STARTING ON SUNDAY 
(DD/MM): ..." or "ENDS TODAY: ...",
and I just write all of the lines of "current" summary into a text-file.

And I also revision-control the text-file. Because that means that looking for
"NEWS items" is just doing a VC diff and looking for lines that start with "+" 
:)

In the case of the Wilton Town Hall Theatre, the text is only *very slightly*
edited from the raw multi-line display format used on 
<http://www.wiltontownhalltheatre.com/>
(_most_ of the phrasing is copied verbatim from the source material--I'm just
 adjusting punctuation, capitalization, and removing some redundancies
 and awkwardnesses that become more apparent when things are condensed
 into single-line format, e.g.: "STARTS SUNDAY (10/27) PLAYING FOR ONE DAY 
ONLY").

The news-item lines are written into my "summary listing" file without 
hashtags/bangtags or other markup--
the thing that actually *posts* them runs some sed substitutions to convert
phrases like "silent film" and "live music" to tags.

Once I'm reasonably sure that a given converter isn't going to just spew 
gibberish
or be overly aggressive about re-notifying based on overly-subtle source-changes
that a reasonable person wouldn't actually count as "news", I add a social 
account
for the the bot and set it up to post, per this article:

        
https://www.linux.com/tutorials/weekend-project-monitor-your-server-statusnet-updates/


Most of the methodology here is coming from the experience of having 
collaborated on
a "Taco Salad Early Warning System" that a friend started when we worked 
together
about a decade ago :)

> I assume most lists could be easily developed and maintained with one of the 
> many Python web scraping libraries, although I have little to no experience 
> with them.

I did the Wilton movie-listings scraper in bash ;p

That website is kind of "simultaneously ideal and pessimal" source material:

        * there's not really any "structured data" to pull out,
          mostly just lists of headings, with some details put into HTML tables
        * but (going by the diffs) the HTML seems to be generated automatically
          based on titles/dates/flags set in some sort of database;
          so it seems safe to assume _some level_ of consistency
        * nothing about that site (other than the specific movies/times being 
listed)
          has changed in *years*--possibly even *decades* at this point,
          and seems unlikely to change any time soon (the web-designer 
referenced
          at the bottom of the page is actually an HVAC contractor who at some 
point
          when the WWW was relatively new apparently decided to dabble in it;
          and the whole point of the theatre is "not changing with the times".

Milford Drive-In's site is apparently based on WordPress..., but with all of 
the blogging/chronology
stuff removed.... Looks like there are RSS feeds, but I've yet to see them 
actually contain anything;
at least the HTML (provided by movienewsletters.com) includes some structured 
markup to make it
easier to pick stuff out.

Looks like the Drive-In's autumn update-schedule is "take the listings offline 
the day after the showing,
decide mid-week what the next set of movies will be and put them up then".

Haven't really looked at any other sites yet.

-- 
Connect with me on the GNU social network! 
<https://status.hackerposse.com/rozzin>
Not on the network? Ask me for more info!
_______________________________________________
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/

Reply via email to