[Robots] Re: Anti-thesaurus proposal

thomas.kay Wed, 30 Jan 2002 05:58:57 -0800


"Klaus Johannes Rusch" wrote:


>> patterns within the uniqueness of an ID assignment, a pattern so 
recognizable 
<snip>
>> Is checking for "*[no]index-[no]follow*" patterns actually a simple enough
>> decision tree to actually work?

>This approach resolves the issue that IDs have to be unique, however it does 
>not address how to identify, without a DTD, which attributes actually are IDs.

I like the feedback, Klaus. Thanks.

The ID attribute identifier term used in the original examples took 
inheritance from the generic attributes of HTML where the resource's DTD were 
as follows:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd";>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
"http://www.w3.org/TR/html4/strict.dtd";>

See the Generic Attributes where ID is identified as "id".

I believe that many robots check for this DTD upon resource load actions.
Now if a DTD does not exist for that resource, then the unique 'id' term that 
identifies an ID is unknown, I agree.  What is the robot to do with this 
resource? Very likely the robot will make a decision similar to any other non 
DTD'd resource.  The robot normally makes decisions do to lack of specified 
DTD knowledge about the resource.  In practice robots minimize  when faced 
with a complexity the robot can't confirm (without a major clock cycle hit or 
without a near AI programming effort.)    The DTD  lets the robot know the 
document wide identification term that holds the unique id strings.  If the 
DTD is missing it is OK to not have the robot do the extra workload of 
reverse engineering the resource to match a known DTD and thereby confirm the 
ID attribute terms.   Where the robots time is very important as it is 
overloaded as it is, this simulates the practical switch in the decision tree 
to avoid less return for the work effort.  In less complex or complete 
robots, the robot could take a leap of faith (programming short cut) and only 
work on "id" and still achieve a majority of the end goal.   
 

>A "cleaner" way of implementing something like this might be using an 
attribute
>in a different namespace, like this:

><html xmlns:robots="http://www.robotstxt.org/2002/robots1";>
>....
><div robots:robots="noindex,follow">
> header nav area
></div>     
<snip>
></html>
>
>or for easier processing, use one attribute per axis:
>
><html xmlns:robots="http://www.robotstxt.org/2002/robots2";>
>...
><div robots:index="false" robots:follow="true">
>  header nav area
></div>
>...
></html>

Humm... Exploring the possible "cleaner" namespace implementation:  What is 
the effort to implement this quickly in a good percent of existing resource 
creation/handling tools?  ( any feelings on the magnitude of work )   Is this 
new suggestion easier or harder overall than the "id" pattern method, in 
balancing out the resources handling efforts?  

The question helps find a path of least resistance in achieving the largest 
part of the needed result, sooner over later.

Given a namespace method or "id" pattern method, which path can be taken for 
all involved tools given the current state of all tools?  If "id" pattern 
method is selected, then could the "id" pattern be extracted later for use in 
a much more rigorous namespace implementation, say as tools grow in support 
of namespace methods.   On detection of "id" pattern, the meaning is 
extractable and can be converted to namespace.  So back we go to 
practicalities in achieving the largest part of the needed result in sooner 
over later terms.  Could showing the "id" pattern method be more of a 
catalyst, a catalyst which triggers the reaction that allows element level 
indexing to begin. Where the final best form is not easily reached in a 
single step, then two steps can compensate for the temporal drag which 
resists the state change.


>While syntactically correct, the semantics are still unclear, e.g.

>* what does "nofollow" mean for text

If a robot takes the time to determine if the text includes URL or URI 
information, then on detection that robot is highly recommend not to follow 
them.   In use as external referencing, those URI may be best ignored by the 
robot effort.  If used with "index" then any example based URI's in their 
non-hypertext  text form is made retrievable.  Example text such as 
"http://ibm.com exists but is not a reference link in this context."    yes/no?

>* how should a robot index and present something like
>  <span robots:robots="index">This is a <span robots:robots="noindex">top 
secret</span> word</span>
>
> As "This is a word"?
 
Yes.  For robots the spanned words of "top secret"  is not a security of 
information concern, but a concern on the security of indexing quality.  If 
that "top secret" span was placed in the index then that index's quality of 
use could be compromised. 

Any content security stays handled at the access level. 

-Thomas Kay
[EMAIL PROTECTED]


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Anti-thesaurus proposal

Reply via email to