On 24 April 2017 at 17:52, Sam Ruby <[email protected]> wrote:
> On Mon, Apr 24, 2017 at 10:34 AM, sebb <[email protected]> wrote:
>> The site-check code currently looks at the link text when searching
>> for required links.
>>
>> Maybe it would make more sense to look for the target URL?
>> That should not vary much, if at all, so it should be easier to find.
>
> If this turns out to be a real problem, both could be extracted.
>
>> Either way, whatever analyses the output probably needs to check that
>> the values are sensible.
>> A License link that points to www.apache.org is not much use, nor is a
>> link to http://www.apache.org/foundation/thanks.html that says
>> "Security"
>
> My thoughts were to split the data gathering and analysis steps.
> That's why when I matched on the text, I provided the link. And when
> I match on the link, I try to gather the text (or img[src]).
Yes, that's what I understood ("whatever analyses the output")
I now think the analyser needs both to properly allow for errors.
That would be more likely to catch misspelt links and URLs.
Also the gleaner code probably needs to allow for multiple links.
> - Sam Ruby