I wonder what people on this list think of this work on "robust hyperlinking" by
Phelps and Wilensky on keeping track of abandoned URLs by document signatures.
How practical is this as a solution to the problem at hand? See also:

http://www.cs.berkeley.edu/~phelps/Robust/robust.html

See below:

Brian Ulicny, PhD
iSail Solutions
Lernout & Hauspie Speech Products
52 Third Ave
Burlington MA 01803
USA




   Researchers work to eradicate broken hyperlinks
   By Evan Hansen
   Staff Writer, CNET News.com
   March 7, 2000, 4:00 a.m. PT

   Researchers at the University of California at Berkeley say they have
come a step closer to solving a
   frustrating problem familiar to most Web surfers: the broken
hyperlink.

   In a recent academic paper, computer scientists Thomas A. Phelps and
Robert Wilensky outlined a way to
   create links among Web pages that will work even if documents are
moved elsewhere. Although researchers
   have tried to tackle the issue before, Internet search experts said
the paper describes a potentially elegant
   solution to a widespread and long-recognized puzzle.

                    "It's a pretty clever way of dealing with a very
difficult problem," said Ron Daniel,
                    who once worked on an alternative solution that has
been submitted to the Internet
                    Engineering Task Force, an online standards body.

                    A key feature of the Web is its ability to take
readers instantly to related
   documents through hyperlinks. Some consider it the soul of the
medium. But as many as one in five Web links
   that are more than a year old may be out of date, according to Andrei
Broder, vice president of research at
   search engine AltaVista. When surfers click on such links, they get a
"404 error" message.

   "The rate of change on the Web is very fast," he said. "And the more
active a Web site is, the quicker it
   changes."

   In their paper, Phelps and Wilensky say the preliminary results of
their research indicate that the vast majority
   of documents on the Web can be uniquely identified based on a small
set of words that no other document
   shares. This set of words can be used to augment the standard URL
(Universal Resource Locator), or Web
   address, and turn up the page if it goes missing.

   One of the things that makes the research interesting, Wilensky said,
is the low number of terms required.

   "It takes about five words to uniquely identify a page if you pick
the words cleverly and the page is still out there
   somewhere," he said.

   If a document's URL changes, a search engine could be employed to
automatically locate the missing page
   based on the five terms.

   "What makes this possible is that you already have a search engine
infrastructure," said Wilensky, a professor
   of computer science at UC Berkeley, who gave most of the credit for
the work to Phelps, a postdoctoral
   student. "You're 'bootstrapping' onto something that's already been
built."

   Wilensky also noted that the system would rely primarily on Web
publishers, rather than on a third-party
   administrator, an issue that had become a hurdle for some other
plans.

   AltaVista's Broder concurred that the results of the research were
promising, reflecting similar research he has
   conducted on "strong queries"--or complex searches--in which he found
that any document can be uniquely
   identified using eight carefully selected terms.

   "The trick is to find the right formula of rare words that are also
important to the meaning of the document," he
   said.

   But Broder warned that the procedure carries the risk that selected
words may later be edited out of the
   document, rendering the identifier moot. For example, he said, in the
Phelps and Wilensky paper, the authors
   used a misspelling, "peroperties," as an identifying term for their
paper.

   He said the most promising element of the work was the fact that it
is compatible with existing systems.

   "There is a chicken-and-egg problem involved," he said. "None of the
big players will adopt (this kind of system)
   until a lot of people start using it."

   Daniel, who said he has given up active research on the problem in
part because of a lack of commercial
   interest in his work, said Phelps and Wilensky may have hit on a way
to solve two parts of a three-part
   problem: determining an identifier and establishing how the
identifier will be linked to a document over the long
   haul.

   But Daniel said they haven't figured out what to do with pages that
are deleted from the Web altogether.

   "Storage is an interesting issue," he said, adding that intellectual
property concerns and rights management
   could become an issue down the road. "At some point perhaps libraries
will evolve into taking an active role in
   indexing pages. But that will depend on publishers giving out the
necessary licensing." Get the Story in "Big
   Picture"






Tim Bray <[EMAIL PROTECTED]> on 03/08/2000 10:30:31 PM

Please respond to Internet robots discussion <[EMAIL PROTECTED]>

To:   [EMAIL PROTECTED]
cc:    (bcc: Brian Ulicny/USER/US/LHS)
Fax to:
Subject:  Re: What happens once robots are barred?



At 11:59 AM 3/8/00 -0800, Mark Bennett wrote:

>* It should also keep track of "orphan" pages - pages that are still
>accessible via the direct URL, but are no longer linked-to by other pages on
>the site.
>
>I believe all 3 classes of pages should be removed from the index.
>
>The third item is an interesting one.  I know some spiders do NOT realize
>that pages are no longer "linked to" and keep indexing them.

When you're indexing a web *site* (i.e. you don't care about anything
outside the web site), this is sensible.  When you're trying to a large scale
index of the whole web, it gets more complex.  If a site has ever been
announced to the outside world, the assumption is that it may have been linked
to from elsewhere; the publishing of a page *should* represent a commitmenet
on the part of the publisher to maintain it.  If the page needs to be removed,
merely removing links to it is violently unsatisfactory since there is no way
an incoming link from outside can know that it's now an orphan.  So such pages
are a live part of the web until removed.  -T.

Reply via email to