Re: how generous in what you accept?

David A. Desrosiers Fri, 06 Jun 2003 08:46:42 -0700

> Note that the URL is not quoted.  I'll agree that it should be, but the
> standard doesn't require it, and it often isn't.


        The standard actually _does_ require it:

        http://www.w3.org/TR/html401/intro/sgmltut.html

        "By default, SGML requires that all attribute values be delimited
         using either double quotation marks (ASCII decimal 34) or single
         quotation marks (ASCII decimal 39)."

        "We recommend using quotation marks even when it is possible to
         eliminate them."

> But that means this gets parsed as
>
> attr1:                href="lost
> attr2:        weekend.html" (no value)
> attr3:        title="pics from last weekend"

        By what tool? I think the tool/lib you are using is flawed, if it
does this. Consider the following:

###
  use strict;
  use HTML::SimpleLinkExtor;

  my $html        = '
          <a href=../images/smily.gif alt="Happy Face">See the Joy!</a>
          <a href="lost weekend.html" title="pics from last
          weekend">pics!</a>';

  my $ext         = HTML::SimpleLinkExtor->new();
  $ext->parse($html);
  my @a_hrefs     = $ext->a;
  print join("\n", @a_hrefs), "\n";

__RESULT__
../images/smily.gif
lost weekend.html

###

> Alternatively, we could do a prescan, before the parser has a chance to
> tokenize things, and clean it up then.  (As you probably do in Perl.) But
> is it really worth rereading every document on every site to guard against
> this one problem?  Not to me, since it doesn't occur on the sites I read.

        Hence one of the issues I ran into this morning, in a brief
discussion with MJ Ray about this. If you encounter a URI which needs to be
encoded, you must encode it, before processing it. If it is _already_
encoded, and you re-encode it, you break it, because you can lose the
original semantics of the URI.

        But how do you tell if the URI has a literal %20 in it, or if it is
a space, previously-encoded? You can't. You can try to apply some heuristics
to determine it, but anything you try using those methods _will_ fail
eventually. The only sane approach is to try to fetch the URI as given, and
if it doesn't return a 200 response code, try some heuristics on it and
fetch again. Consider the following URL:

        http://www.gnu-designs.com/code/test in\g.html

        That is a working URL, though not properly encoded here for this
example. If that URL were stuffed into a current browser, it should try to
translate it into:

        http://www.gnu-designs.com/code/test%20in%5Cg.html

        Now, using some of the previous "bad HTML" sites as examples we've
seen in the past, fetching the unencoded URL will fail, so you can apply
some basic heuristics to change the space to a %20, and the \ to %5C, and
try again, which will work.

        But how do you know if the backslash in a URI is part of a
"filename", or a "directory" mistyped with backslashes and a "file" called
'g.html' inside it? (we all know there is no such thing as "files" or
"directories" in an http request, but for the purposes of explanation here,
I'll use those monikers).

> We don't need to support it fully, but we should degrade as gracefully as
> possible, so users don't lose the rest of the site.

        I agree, but we still may fail on some things we've never
encountered before, and fixing/updating the distillers/parsers to work with
that shouldn't require patches to the parser (in my opinion). The parser
itself should be extensible enough to handle a pseudo-template that can
dictate how to treat those characters on a per-site basis.

        For example, one site I've found to be using backslashes as path
separators is http://www.cnsnews.com/cell.asp. Look at the "Full Story"
links in that page.. all backslashes (and relative links too). In that case,
translating the backslash to a forward slash before making the request would
make sense.

        If you were to take the backslash in the "test in\g.html" example
above, and translate it into a forward slash, it would no longer be a valid
request. Chicken and egg problem, of sorts.

> I haven't seen avantgo://.

        I don't recall the site that was using it, but it does exist. Also,
take a look at http://slashdot.org/palm/ very closely. They are using a
double-slash relative URL for their top banner image on the page, which is
also invalid, since it actually is NOT a relative URL there, but a missing
'http:' in front of it. This too must be compensated for. You can see how
many of these can happen in the course of scraping the entire internet for
content.

> pods was once just a tiny subset of javascript for things like home and
> back.  We could have handled it; I ignore it since it doesn't add value
> for me, and ignoring it causes no problem.

        Again, I simply translate it with URI, as I've shown before in a
snippet that turns http:// into plucker:// and vice versa. Easy to do.

> pods is now a much fuller language, effectively a library call. This would
> be harder to support, but I've never seen the full functionality in use on
> the sites I visit.  (I think there was one site that used the
> add-to-schedule functionality.)

        My opinion is that we should completely ignore this, since browsers
don't support it, nor should we (to quote the logic of some of the previous
posters on the previous thread).

> MS extensions can generally be treated as unknown tags.

        In most cases, MS-HTML is close-enough to be rendered mostly-right,
but it still contains junk. Do you ignore everything inside the "invalid"
tags, or just render the tags and content as text, or just render the tex
element, ignoring the invalid "tags" themselves? What rulesets do you apply?
Do you let the user decide?

> Improper nesting is a pain.  Ideally, we should handle it better than we
> do, but I agree that this could be a time sink.

        There are ways to fix this, but it requires a bit more logic and
post-processing (and XML/SGML foo'ing at the document tree level).

> I do think it would be useful to say "x pages, y kilobytes, z problems"
> and to pop up a warning if the size if there are problems, or the size is
> very different from expected.

        Size of what? What I'm doing is taking the general request, running
it through a series of "carwash" functions that remove the stuff we don't
use right now; entire javascript blocks, style tags and values, font face
values, comments, and unnecessary spacing. I am left with a page
substantially smaller than the original (in most cases), and then I run it
through HTML::Clean and libxml, which turns it into (mostly) validated
content, fixing improperly-nested tags and adding missing alt tags and
quotes where they belong. It's worked remarkably well for me thus far. For
example, palmgear.com's main page is riddled with over 20k of this
unnecessary HTML, spacing and javascript.  Running it through my
clean_html() sub results in the following:

        Length before.....: 79,398 bytes
        Length after......: 58,419 bytes
        Total difference..: 20,979 bytes

        That's substantial, especially considering you have to store that
content in memory (or on disk) and post-process it later as you convert it
to Plucker format. It's not perfected yet, but it definately makes it easier
to parse the content, because now entities are properly encoded, tags are
(mostly) properly-nested, everything is properly quoted, and the content is
much smaller to manage when in memory.

> The python distiller can put out that information, but the desktop doesn't
> display it or act on it. There is nothing anywhere to pop up warnings that
> the pluck should be checked before going home.

        You could just have an error.log created for that channel, and if
the desktop component sees an error log, pop up a dialog with the log in it.
"Errors encountered during this fetch were as follows..." or some such. I
don't use the desktop tools, so I'm not sure of the capabilities there.

> I also haven't seen a good way to see what it plucked before syncing,
> though I think there may be viewers out there - just not included with the
> main package.

        I'm not sure what you mean here. You mean write the files to disk,
before they are concatenated into the final .pdb file? Doesn't the Python
parser still support the caching of files to disk?


d.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: how generous in what you accept?

Reply via email to