[
https://issues.apache.org/jira/browse/MAHOUT-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102502#comment-13102502
]
Dan Brickley commented on MAHOUT-804:
-------------------------------------
Found some Mahout backstory,
http://lucene.472066.n3.nabble.com/Old-Site-td1406298.html
This cites
https://cwiki.apache.org/CWIKI/#Index-Canweusetheautoexportsiteaspartofourmainwebsite%3F
which in turn suggests that auto-exported sites will be disabled around Nov
2011:
"Projects who do currently use Confluence as a CMS and then use the
AutoExported tool to create a website should make plans to convert to another
publishing tool, such as the new CMS system. It is estimated that crontab jobs
containing rsyncs of autoexport sites will be removed around November 2011
(Likely during ApacheCon)"
"Can we use the mirror of the autoexport site as our main web site?
Yes (until Around November 2011). You can either name the home page for your
site "index", so that it will load by default, or add a index.html to the root
website folder that will redirect to the export folder. BUT note the caveats
above, no more new autoexported websites effective 1/11/2011 and existing sites
to be phased out by November 2011."
"It is estimated that crontab jobs containing rsyncs of autoexport sites will
be removed around November 2011 (Likely during ApacheCon)."
Since mahout.apache.org links for example to
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms and
https://cwiki.apache.org/confluence/display/MAHOUT/Overview ... (rather than
the auto-exported versions), is that a problem?
The page also has a note at the top saying, "The first rule of CWIKI is don't
link to CWIKI! This Confluence site is autoexported to HTML. Please link only
to the exported pages. Do not link directly to the wiki! The autoexport
includes live links to allow easy editing of pages. By linking to the
autoexport, we can scale the site for everyone's benefit."
Note that if I google a distinctive phrase from Overview, e.g. "Scalable to
reasonably large data sets. Our core algorithms for clustering, classfication
and batch based collaborative filtering are implemented on top of Apache Hadoop
using the map/reduce paradigm."
... the search results are just to the auto-exported .html, not links into the
confluence Wiki. This seems to be robot exclusion protocol:
If auto-exports are to be turned off, https://cwiki.apache.org/robots.txt
"User-agent: *
Disallow: /confluence/" [...]
... might need reconsidering.
So, in summary:
* google only links to the auto-exported .html
* 'first rule of cwiki is don't link to cwiki'
* mahout wiki links to the cwiki version
* cwiki HTML uses rel=canonical to express that it is the canonical version;
however this is stripped from the auto-export
* cwiki main site is excluded from search engine indexes
* notes in the cwiki admin wiki suggest auto-exports will soon be turned off
> Each page in Mahout's Confluence Wiki has 2 URLs, with differing page styles
> and search behaviours
> --------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-804
> URL: https://issues.apache.org/jira/browse/MAHOUT-804
> Project: Mahout
> Issue Type: Improvement
> Components: Website
> Reporter: Dan Brickley
> Priority: Trivial
> Labels: atlassian, confluence, wiki
>
> There are two styles of URL in circulation for URLs into Mahout's Wiki
> (presumably an Apache-wide configuration issue):
> https://cwiki.apache.org/MAHOUT/svd-singular-value-decomposition.html vs
> https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition
> They appear to be the self-same confluence 3.4.9 installation (or its raw
> filetree). Each has a different search box at the top of the page. The
> version with 'confluence/' in the path does a confluence search, and returns
> similar URLs as results. The one with '.html' suffixes does a
> domain-constrained Google search.
> Despite markup canonicalising the confluence variant, ie. <link
> rel="canonical"
> href="https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition">
> appearing in the confluence pages, it seems the Google search results
> typically throw people into the other version of the Wiki site.
> This is all mildly confusing, mildly annoying but overall mostly harmless. It
> could be having some negative impact on google rank & suchlike, since
> incoming links will be split between the two styles. Maybe this could be
> passed along to the Wiki admins?
> Which version does the Mahout team consider canonical URLs (for external
> links etc)?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira