Jack Campin comments: | | Where you *are* going to get a problem is if the input file uses a | mixture of linebreak characters and HTML tags to indicate ABC line | ends. Could anybody really be that stupid?... er, well...
In fact, I've seen ABC embedded in HTML by using the <pre>...</pre> tags, and then putting <br> at the end of each line. The <pre> tag means to preserve whitespace (including CR and LF) exactly, while processing tags in the data. The <br> tag means to generate a line separator. So such text is explicitly double spaced. The writer may not have understood this, but software on the receiving end can't yet read the sender's mind. | > An emerging requirement for HTML is that ALL tags be paired win | > an <TAG ON> </TAG OFF>. | | Really? I thought </P> was deprecated? Heh. This is a case of the old quip that the nice thing about standards is that we have so many to choose from. You'll find some very confusing comments on this in the W3C HTML standards, depending on just which you happen to read. And some commercial vendors (most notably Microsoft, but they're not the only culprits) show a great deal of contempt for the official standards. In any case, if you're writing a search program, you have to deal with what's out there, not what the standards may say. A contrary pressure is coming from the growing tendency to use XML, and that standard makes an unambiguous statement that closing tags are always required. There is the shorthand that combines them, as in <br/>, but that doesn't really effect much. The new Fiddler's Companion is an interesting case. If you look at http://www.ibiblio.org/fiddlers/AA_ABEL.htm, for example, you'll not only see some HTML loaded down with style information (some of which is off in other files), but you'll also see several <xml> ... </xml> sections. So there are at least three different markup schemes in use here. The header says it was generated by Microsoft Word 10, so lotsa luck finding any software except Microsoft's that can correctly decode it. The fact that different browsers display the ABC with different spacing is not at all surprising. Actually, this reminds me of a discussion that I've seen in some other fora: Microsoft has received a US patent on some of their XML encodings generated by Word. This may not matter much yet outside the US, though Europe is probably going to enable similar laws shortly. In the US, decoding such files with software not licensed by Microsoft is not only a patent infringement; it is also a DMCA violation. As such, it is a federal felony, and can get you a 5-year jail sentence and a $500,000 fine. Since the above file was encoded with authorized software, Andrew is probably safe from prosecution. But anyone who reads it inside the US with non-Microsoft software could well be committing a felony. It's only been a few months since MS got the patent, and they haven't yet prosecuted anyone. But you might wonder why they applied for the patent if they don't intend to enforce it. There has been a suggestion that non-MS software add a check for MS-Word docs. What they'd probably do is pop up a little window with a warning, and ask you if you want to accept the legal risk of reading the contents. If you click Yes, that would probably exonerate the authors of the software, since it would then be your decision to violate the MS patent. I'm considering adding a check for such headers to my search bot, and simply abort the processing of any such file, to keep myself out of a federal prison. Isn't "Intellectual Property" a fun topic? To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html