[Dspace-devel] [DuraSpace JIRA] (DS-1138) robots.txt

DuraSpace JIRA Wed, 27 Feb 2013 04:24:28 -0800

    [ 
https://jira.duraspace.org/browse/DS-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=27758#comment-27758
 ]


Ivan Masár commented on DS-1138:
--------------------------------

As a post-mortem:
* I found out that this still doesn't filter out the 
http://example.com/handle/123456789/123/discover URLs. I didn't find any syntax 
in the Robots Exclusion Standard that would allow us to do this.
* I verified with its operator, that this change will negatively affect the 
size parameter (which is a significant component of the rank) of all DSpace 
installations in the http://repositories.webometrics.info/ site. However, while 
he admitted that these various index pages artificially inflated the rank of 
all DSpace instances, he refused to consider any changes in their ranking 
algorithm. Instead, he proposed something else:

"So, the answer is not to blocking robots or any other strange action but to 
enrich the individual record page. How?

- Increasing the number of times the title (usually the entry point for any 
researcher) is mentioned in the record
- including the (full) title in the Google preferred tags: <TITLE> and URL"

IMHO we can't do any better in this regard and the robots approach to remove 
the extra pages is right. I personally stopped considering the Ranking Web of 
Repositories as a relevant ranking, due to the fact that their algorithm 
depends strongly on quantity and quantity can be artificially inflated, which 
is what DSpace has been unintentionally doing with Discovery in 1.7-1.8.
                
> robots.txt
> ----------
>
>                 Key: DS-1138
>                 URL: https://jira.duraspace.org/browse/DS-1138
>             Project: DSpace
>          Issue Type: Bug
>            Reporter: Ivan Masár
>            Assignee: Tim Donohue
>             Fix For: 3.0
>
>
> By default, robots.txt in XMLUI allows indexing all content. This leads to 
> indexing all browse, search and discovery pages. Search engines then give 
> mostly results pointing to these lists of results instead of the proper 
> items. I suggest to disallow the following pages by default:
> User-agent: *
> Disallow: /discover
> Disallow: /search-filter
> Note, that current robots.txt contains this message:
> # Uncomment the following line ONLY if sitemaps.org or HTML sitemaps are used
> # and you have verified that your site is being indexed correctly.
> # Disallow: /browse
> Since all items should be accessible via the browse pages in the 
> community/collection structure, /browse pages should be allowed by default to 
> enable spiders to explore the whole repository. But /discover and 
> /search-filter are surely redundant and only clutter the search results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DuraSpace JIRA] (DS-1138) robots.txt

Reply via email to