[Robots] Re: Anti-thesaurus proposal

thomas.kay Tue, 29 Jan 2002 07:35:24 -0800


Form: Reply
Text: (153 lines follow)
Klaus & Solon, thanks again for the feed back.

Recapping the main thought lines:   

Checking for "*[no]index-[no]follow*" patterns in the ID element of resource 
markup.

We all would like a way to help robot agents make informed decisions at the 
element levels of a resource.

This lead to an investigation of whether the word terms of robots.txt can be 
used at the element levels to do similar function.

As it applies to recognized patterns in robots.txt,  I proposed building on 
the strengths of the character patterns "index", "noindex", "follow", 
"nofollow"  and their meaning in the element identification process which 
isolates all parts of a resource.   

I am involved with deploying web content management systems (WCMS), multiple 
index search and retrieval (ISR) engine systems, and metadata enhancement 
systems.  I am data modelling the integration paths of the deployed systems 
at their data layers to spotlight untapped building block areas that could be 
exploited. 

Question: 
   Is it viable to show WCMS the possible uses for inserted or detectable 
patterns within the uniqueness of an ID assignment, a pattern so recognizable 
as to allow all MISR robots a common element level decision process?       

>From the anti-thesaurus proposal we see suggestions that it would be useful 
to have a way to allow robot decisions at the element level.  The  patterns 
in robot.txt may provide a solution path that robot creators can agree on to 
use at the element level.   The web content management systems (WCMS) and 
other assembly tools can then have a method to map out areas that are rich in 
meaningful content and meaningful links to related content.  The areas that 
are devoid of meaningful content or references can also be mapped out.  The 
areas remaining are not identified to external robot systems so that area,  
defaults on the decision handling to whatever the current robot decision 
process was for that element type.    

Each mark up element has an identification assignment for that resources 
unique use, yet those elements ID can imply knowledge to an external resource 
by way of patterns.  This is recognition of patterns within the ID strings of 
mark-up elements.   This is recognition that is outside and independent of 
the primary local use for the unique ID itself.  This is an ID  pattern for 
which a robot can use to make a decision.   The decision is regarding that 
element identified by a unique ID as a whole, yet the decision is based on 
the common pattern within the ID string.  The robot decision process is now 
enhanced to what that each element actually marks up.   The robot checks the 
resource to see if the resources elements have been made external index 
friendly.

Technical application logic:

The ID pattern to check here is "*[no]index-[no]follow*" , 
If not true just handle the element as you would normally. 
Now if the above pattern is found to be true:
   -  then on a "[no]" position returning:
         true: act to exclude or demote that element of the resource from 
that particular indexing or following process. 
         false:  act to enhance inclusion or promote that element in that 
particular indexing or following process.

Examples:
The following are examples of unique ID with a pattern that could be 
exploited in robot work:

...
<div id="noindex-follow" >header nav area</div>
<div id="hd-local-index-follow" >more header nav area</div>
<div id="sb-global-noindex-follow" >side nav area</div>
<div id="sb-local-index-follow" >more side nav area</div>
<span id="ins_noindex-follow-3" >common one liner</span>
<div id="index-follow" >main text, and
<span id="ad-noindex-nofollow" >main text ad insert</span>
more high reference text
<span id="joke-noindex-nofollow"> insert the  resource miner's moral booster 
of the day</span>
</div>
<span id="ad1_noindex-nofollow" >a common content sponsor one liner</span>
<div id="ft-index-follow-1" >footer nav area</div>
<div id="ft-noindex-follow-2" >more footer nav area</div>
<span id="ticker_noindex-nofollow" >a random news item</span>
<span id="ad2_noindex-nofollow" >a random service sponsor one liner</span>
...

Assumptions check Questions:

Is checking for "*[no]index-[no]follow*" patterns actually a simple enough 
decision tree to actually work? 

Can we build on it?  

Would any be willing to investigate if their robot code libraries decision 
trees could adapt to detect and use a knowledge of the discussed pattern?  

If you feel any of this has workable merits,  is it worth a RFC style 
submission for the benefit of all that wish to comment?

Future leverage:
Are we curious enough to have comments from those creating the semantic web, 
as element recall over resource recall is very interesting?    Curious to 
find out how to move to imply a trust to the ID element patterns,  just 
enough trust to recall those elements and work with them in a "robot.txt" 
way.  Our curiosity may supply just enough method to help build the bridges 
toward a semantic web.   

-Thomas Kay

"Klaus Johannes Rusch" wrote:  

>The id attribute is defined as an ID, as the name implies, so it must be
>unique, so this cannot be used to mark areas of the page indexable.

"Solon Edmunds wrote: 
> Use <span> instead of <div> because the use of div will effect page
>formatting, whilst span will not"
---------- Original Text ----------

From: "Klaus Johannes Rusch" <[EMAIL PROTECTED]>, on 29/11/2001 11:41 AM:

Solon Edmunds wrote:

> So has anyone seen/done anything like....
> 
> <div id="robots-txt-noindex-follow" class="robots">
> {headers/footer/siderbars} </div>
> 
> <div id="robots-txt-noindex-nofollow" class="robots">
> {a banner area }
> </div>
> 
> <div id="robots-txt-index-nofollow" class="robots">
> { content for the index, but holds looping links or dynamically
> generated
> links which are best navigated via the statedataless sitemaps links. }
> </div>

The id attribute is defined as an ID, as the name implies, so it must be
unique, so this cannot be used to mark areas of the page indexable.

-- 
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of 
a message to "[EMAIL PROTECTED]".

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Anti-thesaurus proposal

Reply via email to