Re: site scan algorithm and output data

Shane Curcuru Tue, 09 May 2017 05:08:08 -0700

sebb wrote on 5/9/17 6:08 AM:
> The site scanner currently looks for specific links *or* specific text.
> 
> This does not always work well, e.g. httpd uses 'Sponsors' for the
> 'Thanks' link, so it appears to have no link rather than one with an
> 'incorrect' name.
> 
> I think it would be better to search for both the expected text and
> the expected link, and record any matches for either.


Agreed.

Note that the analysis step is never likely going to be 100% accurate,
since the current policy is written with the intent in mind, not a
specific formula.  But you're right: looking for, and also storing scan
data for both links and text is a great way to improve results.

Separately, I do think having an "approved exceptions" list is an easier
way to improve results in some cases rather than funkier regexes or the
like.  See concept in "Re: Rename site-check.rb => site-scan.rb?", but
improved to match your additions here:

site-exceptions.json
{
  "axis": {
    "trademarks": { :allowed_string "Trademark Registered of The ASF" },
    "events": { :allowed_url "http://www.apache.org/special-event"; }
  },
  ...
}

> 
> Probably the search targets should also be recorded in the analysis output.
> This should make it easier for the analysis to report what was expected.
> 
> for example:
> 
> httpd: {
>    ...
>    sponsorship: {
>       text: {
>         expected: "Thanks",
>         found: ["http://.../";]
>       },
>       link: {
>         expected: "http://...";,
>         found: ["Sponsors"]
>       },
>   }
> }
> 
> 
> Obviously this would mean changes to the analysis as well.
> 
> Thoughts?
> 


-- 

- Shane
  https://www.apache.org/foundation/marks/resources

Re: site scan algorithm and output data

Reply via email to