I'd like to simplify some of the site-scan.rb/site.cgi processing by
centralizing some of the core things that the scripts are searching for
into site-scan.rb.  While I appreciate the original design motivation,
we currently have duplicate regexes - and we have more people interested
in using the results of the site scan (esp. with events) and officers
potentially requesting changes to the requirements.

Roughly, I'd like to move most of CHECKS into site-scan.rb for
simplicity and use those to implement most of the link scans.  Some of
the scans still have more logic (which would still be custom), but some
of them can be mechanical.

CHECKS = {
  'events'      =>
    [
      '',
      # a_text regex to scan for - for events, we don't care, so blank
      '/apache.org/events',
      # a_href minimal regex to capture - for events, this tells us what
link to capture from the page
      %r{^https?://.*apache.org/events/current-event}
      # a_href full regex to expect for compliance (used in site.cgi)
    ],

  'license'      =>
    [
      '/licenses?/',
      # a_text regex to scan for - for license, this is required
      'apache.org',
      # a_href minimal regex to capture - for license, we only capture
the link if it points to apache.org
      %r{^https?://.*apache.org/licenses/$}
      # a_href full regex to expect for compliance; it must point to one
of our actual licenses to pass
    ],
...etc.
}

Any overall objections?  It's making me twitchy seeing most of the
regexes we use for scanning in separate places.

--

- Shane
  Director & Member
  The Apache Software Foundation

Reply via email to