We could try this:

--- 

# robots.txt for http://uima.apache.org

User-agent: *
Disallow: /docs/d/
Allow: /docs/d/ruta-current/
Allow: /docs/d/uima-addons-current/
Allow: /docs/d/uima-as-current/
Allow: /docs/d/uima-ducc-current/
Allow: /docs/d/uimacpp-current/
Allow: /docs/d/uimafit-current/
Allow: /docs/d/uimaj-current/

---

Sources on the net say that "Allow" wasn't originally defined, so if we do
the above, it might be that some search engines don't index the docs anymore
at all. We might want to set the user-agent to "googlebot".

Also, not all of the documentations use the "*-current" trick yet. But that
is easy to fix.

Cheers,

-- Richard

> On 07.04.2016, at 22:40, Richard Eckart de Castilho <[email protected]> wrote:
> 
> We can just disallow /d and then allow all the  *-current folders
> under it explicitly. The only difference I see is that we'd have
> a couple of more entries in the robots.txt.
> 
> -- Richard
> 
>> On 07.04.2016, at 22:36, Marshall Schor <[email protected]> wrote:
>> 
>> Hi,
>> 
>> This sounds like a good idea to me :-)
>> 
>> There's one small issue possibly, to changing the folder structure.  The 
>> DOCBOOK
>> schemes have some fancy way to link between docbooks; these require that the
>> books be kept relative to one another in some file tree structure.  As long 
>> as
>> that's not changed, I think there will be no problem. 
>> 
>> If anyone's curious, the relevant bits of config info are in the
>> uima-docbook-olink project, in the various "site.xml" files.  You can see 
>> refs
>> to the famous "d" folder there.  There may be a dependency on the "books" 
>> being
>> just one directory layer under d/, so putting an extra layer might break 
>> things
>> (but I'm not sure...).
>> 
>> Maybe there's a way to do this without introducing a new level in the 
>> directory?
>> 
>> -Marshall
>> 
>> On 4/6/2016 4:43 PM, Richard Eckart de Castilho wrote:
>>> Hi all,
>>> 
>>> I believe some time back we were talking about a strategy to avoid search 
>>> engines pointing to ancient version of the UIMA documentation.
>>> 
>>> I have read a bit on rel="canonical" and robots.txt.
>>> 
>>> 1) per webpage - Apparently, one can place a `link rel="canonical"` element 
>>> on any HTML page. Search engines seeing this tag will then not index this 
>>> page because it is considered to be a duplicate of whatever other page the 
>>> link points to.
>>> 
>>> 2) via http header/htaccess - Since we probably don't want to patch up all 
>>> our JavaDoc files, the information about a canonical source can also be 
>>> sent in the HTTP header, e.g. via a suitable htaccess file.
>>> 
>>> I guess the idea would be that for any old documentation page, we would 
>>> want it to point to its latest version as its canonical source. I mean for 
>>> every page, not only for the index page. This seems a bit tedious.
>>> 
>>> My suggestion would be an alternative that exploits the website folder 
>>> structure and uses robots.txt.
>>> 
>>> We disallow indexing of the "d" folder on the UIMA website.
>>> We place all the "*-current" folders (svn copies of the latest 
>>> documentation versions) under a dedicated folder (e.g. "d/current") and 
>>> allow indexing that.
>>> 
>>> In that way, the outdated versions of the documentation should be hidden 
>>> from the search engines and the respective latest versions should be 
>>> indexed.
>>> 
>>> Opinions? Does anybody have experience with SEO?
>>> 
>>> Cheers,
>>> 
>>> -- Richard
>>> 
>>> 
>> 
> 

Reply via email to