I wonder what people on this list think of this work on "robust hyperlinking" by Phelps and Wilensky on keeping track of abandoned URLs by document signatures. How practical is this as a solution to the problem at hand? See also:
http://www.cs.berkeley.edu/~phelps/Robust/robust.html See below: Brian Ulicny, PhD iSail Solutions Lernout & Hauspie Speech Products 52 Third Ave Burlington MA 01803 USA Researchers work to eradicate broken hyperlinks By Evan Hansen Staff Writer, CNET News.com March 7, 2000, 4:00 a.m. PT Researchers at the University of California at Berkeley say they have come a step closer to solving a frustrating problem familiar to most Web surfers: the broken hyperlink. In a recent academic paper, computer scientists Thomas A. Phelps and Robert Wilensky outlined a way to create links among Web pages that will work even if documents are moved elsewhere. Although researchers have tried to tackle the issue before, Internet search experts said the paper describes a potentially elegant solution to a widespread and long-recognized puzzle. "It's a pretty clever way of dealing with a very difficult problem," said Ron Daniel, who once worked on an alternative solution that has been submitted to the Internet Engineering Task Force, an online standards body. A key feature of the Web is its ability to take readers instantly to related documents through hyperlinks. Some consider it the soul of the medium. But as many as one in five Web links that are more than a year old may be out of date, according to Andrei Broder, vice president of research at search engine AltaVista. When surfers click on such links, they get a "404 error" message. "The rate of change on the Web is very fast," he said. "And the more active a Web site is, the quicker it changes." In their paper, Phelps and Wilensky say the preliminary results of their research indicate that the vast majority of documents on the Web can be uniquely identified based on a small set of words that no other document shares. This set of words can be used to augment the standard URL (Universal Resource Locator), or Web address, and turn up the page if it goes missing. One of the things that makes the research interesting, Wilensky said, is the low number of terms required. "It takes about five words to uniquely identify a page if you pick the words cleverly and the page is still out there somewhere," he said. If a document's URL changes, a search engine could be employed to automatically locate the missing page based on the five terms. "What makes this possible is that you already have a search engine infrastructure," said Wilensky, a professor of computer science at UC Berkeley, who gave most of the credit for the work to Phelps, a postdoctoral student. "You're 'bootstrapping' onto something that's already been built." Wilensky also noted that the system would rely primarily on Web publishers, rather than on a third-party administrator, an issue that had become a hurdle for some other plans. AltaVista's Broder concurred that the results of the research were promising, reflecting similar research he has conducted on "strong queries"--or complex searches--in which he found that any document can be uniquely identified using eight carefully selected terms. "The trick is to find the right formula of rare words that are also important to the meaning of the document," he said. But Broder warned that the procedure carries the risk that selected words may later be edited out of the document, rendering the identifier moot. For example, he said, in the Phelps and Wilensky paper, the authors used a misspelling, "peroperties," as an identifying term for their paper. He said the most promising element of the work was the fact that it is compatible with existing systems. "There is a chicken-and-egg problem involved," he said. "None of the big players will adopt (this kind of system) until a lot of people start using it." Daniel, who said he has given up active research on the problem in part because of a lack of commercial interest in his work, said Phelps and Wilensky may have hit on a way to solve two parts of a three-part problem: determining an identifier and establishing how the identifier will be linked to a document over the long haul. But Daniel said they haven't figured out what to do with pages that are deleted from the Web altogether. "Storage is an interesting issue," he said, adding that intellectual property concerns and rights management could become an issue down the road. "At some point perhaps libraries will evolve into taking an active role in indexing pages. But that will depend on publishers giving out the necessary licensing." Get the Story in "Big Picture" Tim Bray <[EMAIL PROTECTED]> on 03/08/2000 10:30:31 PM Please respond to Internet robots discussion <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] cc: (bcc: Brian Ulicny/USER/US/LHS) Fax to: Subject: Re: What happens once robots are barred? At 11:59 AM 3/8/00 -0800, Mark Bennett wrote: >* It should also keep track of "orphan" pages - pages that are still >accessible via the direct URL, but are no longer linked-to by other pages on >the site. > >I believe all 3 classes of pages should be removed from the index. > >The third item is an interesting one. I know some spiders do NOT realize >that pages are no longer "linked to" and keep indexing them. When you're indexing a web *site* (i.e. you don't care about anything outside the web site), this is sensible. When you're trying to a large scale index of the whole web, it gets more complex. If a site has ever been announced to the outside world, the assumption is that it may have been linked to from elsewhere; the publishing of a page *should* represent a commitmenet on the part of the publisher to maintain it. If the page needs to be removed, merely removing links to it is violently unsatisfactory since there is no way an incoming link from outside can know that it's now an orphan. So such pages are a live part of the web until removed. -T.