Jack Campin comments:
|
| Where you *are* going to get a problem is if the input file uses a
| mixture of linebreak characters and HTML tags to indicate ABC line
| ends.  Could anybody really be that stupid?... er, well...

In fact, I've seen ABC embedded in HTML by using  the  <pre>...</pre>
tags,  and  then putting <br> at the end of each line.  The <pre> tag
means to preserve whitespace (including CR  and  LF)  exactly,  while
processing  tags  in the data.  The <br> tag means to generate a line
separator.  So such text is explicitly double spaced.  The writer may
not have understood this, but software on the receiving end can't yet
read the sender's mind.

| > An emerging requirement for HTML is that ALL tags be paired win
| > an <TAG ON> </TAG  OFF>.
|
| Really?  I thought </P> was deprecated?

Heh.  This is a case of the  old  quip  that  the  nice  thing  about
standards  is  that we have so many to choose from.  You'll find some
very confusing comments on this in the W3C HTML standards,  depending
on  just which you happen to read.  And some commercial vendors (most
notably Microsoft, but they're not the only culprits)  show  a  great
deal  of contempt for the official standards.  In any case, if you're
writing a search program, you have to deal with what's out there, not
what the standards may say.

A contrary pressure is coming from the growing tendency to  use  XML,
and  that  standard  makes an unambiguous statement that closing tags
are always required. There is the shorthand that combines them, as in
<br/>, but that doesn't really effect much.

The new Fiddler's Companion is an interesting case.  If you  look  at
http://www.ibiblio.org/fiddlers/AA_ABEL.htm,  for example, you'll not
only see some HTML loaded down with style information (some of  which
is off in other files), but you'll also see several <xml> ...  </xml>
sections. So there are at least three different markup schemes in use
here. The header says it was generated by Microsoft Word 10, so lotsa
luck finding any  software  except  Microsoft's  that  can  correctly
decode  it.   The  fact  that different browsers display the ABC with
different spacing is not at all surprising.

Actually, this reminds me of a discussion  that  I've  seen  in  some
other  fora:  Microsoft has received a US patent on some of their XML
encodings generated by Word. This may not matter much yet outside the
US,  though  Europe is probably going to enable similar laws shortly.
In the  US,  decoding  such  files  with  software  not  licensed  by
Microsoft  is  not  only  a  patent  infringement;  it is also a DMCA
violation.  As such, it is a federal felony, and can get you a 5-year
jail  sentence and a $500,000 fine.  Since the above file was encoded
with authorized software, Andrew is probably safe  from  prosecution.
But  anyone  who  reads  it inside the US with non-Microsoft software
could well be committing a felony.  It's only been a few months since
MS  got  the patent, and they haven't yet prosecuted anyone.  But you
might wonder why they applied for the patent if they don't intend  to
enforce it.

There has been a suggestion that non-MS  software  add  a  check  for
MS-Word docs.  What they'd probably do is pop up a little window with
a warning, and ask you if you  want  to  accept  the  legal  risk  of
reading the contents. If you click Yes, that would probably exonerate
the authors of the software, since it would then be your decision  to
violate the MS patent.

I'm considering adding a check for such headers to my search bot, and
simply abort the processing of any such file, to keep myself out of a
federal prison.

Isn't "Intellectual Property" a fun topic?

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to