Re: [Dspace-tech] Issue about Google crawler

2010-10-05 Thread Kim Shepherd
Hi Panyarak,

It might be an idea to add /displaystats to your JSPUI's robots.txt and to
any Google Webmaster Tools robots.txt files or Page Removal Requests.
For Google to de-index pages, it generally likes to see a 404 (not found) or
a 410 (gone).

Unfortunately, the servlet that handles statistics display for JSPUI throws
a NullPointerException when a handle is passed to it that doesn't turn into
a valid DSpace object. It *should* throw a friendly 404 to help crawlers
like Google realise the page is gone.

I've opened a JIRA issue for the NPE bug -
http://jira.dspace.org/jira/browse/DS-689 - and attached a patch for 1.6.2
(and trunk, and probably other 1.6.x versions) that will make sure that when
anyone (including Google) visits those pages, it sees a 404 instead of
Internal Server Error.

Hopefully this, along with /displaystats (and/or /displaystats* ?) in your
robots.txts or removal requests will help convince Google to stop crawling.

Cheers,

Kim

On 4 October 2010 13:52, Panyarak Ngamsritragul pa...@me.psu.ac.th wrote:


 Dear all,

 A couple of weeks ago I have posted questions about Google crawler and
 sitemaps.  There was a response from Vinit, but I still could not reach
 the solution to what I am experiencing.

 I am running 1.6.2 and have registered the site (kb.psu.ac.th) to Google's
 webmaster tools.  I understand that I have submitted the sitemaps.  After
 sometimes, I have repeatedly receiving Internal server error as a result
 of Google crawler trying to access some non-existence records.  Some of
 the records were repeatedly accessed by crawler for more than a month now.

 Can anyone help me to pin point the root cause of the problem?
 I have attaced here with one of the error messages.

 Thanks.

 Panyarak Ngamsritragul
 Prince of Songkla University.
 Thailand.

 -- Forwarded message --
 Date: Sun, 3 Oct 2010 18:50:06 +0700 (ICT)
 From: psukb-nore...@psu.ac.th
 To: psukb-h...@me.psu.ac.th
 Subject: PSUKB: Internal Server Error

 An internal server error occurred on http://kb.psu.ac.th/psukb:

 Date:   10/3/10 6:50 PM
 Session ID: D5E58233D9F2093B248C4CC5C65D96D1
 User:   Anonymous
 IP address: 66.249.69.1

 -- URL Was: http://kb.psu.ac.th:8080/psukb/displaystats?handle=2553/929
 -- Method: GET
 -- Parameters were:
 -- handle: 2553/929

 Exception:
 java.lang.NullPointerException
at
 org.dspace.app.webui.servlet.DisplayStatisticsServlet.displayStatistics(DisplayStatisticsServlet.java:182)
at
 org.dspace.app.webui.servlet.DisplayStatisticsServlet.doDSGet(DisplayStatisticsServlet.java:123)
at
 org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151)
at
 org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
 org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:112)
at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465)
at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:619)

 --
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.



 --
 Virtualization is moving to the mainstream and overtaking non-virtualized
 environment for deploying applications. Does it make network security
 easier or more difficult to achieve? Read this whitepaper to separate the
 two and get a better understanding.
 http://p.sf.net/sfu/hp-phase2-d2d
 ___
 DSpace-tech mailing list

Re: [Dspace-tech] Issue about Google crawler

2010-10-05 Thread Kim Shepherd
I should point out that my robots.txt suggestions assume you don't want any
stats pages crawled at all... if that's not true, it's probably best to
apply the patch for DS-689 and wait for Google to de-index (and make the
robots.txt entries more specific if there are only a few invalid handles
being requested)

Cheers,

Kim

On 6 October 2010 00:30, Kim Shepherd kim.sheph...@gmail.com wrote:

 Hi Panyarak,

 It might be an idea to add /displaystats to your JSPUI's robots.txt and to
 any Google Webmaster Tools robots.txt files or Page Removal Requests.
 For Google to de-index pages, it generally likes to see a 404 (not found)
 or a 410 (gone).

 Unfortunately, the servlet that handles statistics display for JSPUI throws
 a NullPointerException when a handle is passed to it that doesn't turn into
 a valid DSpace object. It *should* throw a friendly 404 to help crawlers
 like Google realise the page is gone.

 I've opened a JIRA issue for the NPE bug -
 http://jira.dspace.org/jira/browse/DS-689 - and attached a patch for 1.6.2
 (and trunk, and probably other 1.6.x versions) that will make sure that when
 anyone (including Google) visits those pages, it sees a 404 instead of
 Internal Server Error.

 Hopefully this, along with /displaystats (and/or /displaystats* ?) in your
 robots.txts or removal requests will help convince Google to stop crawling.

 Cheers,

 Kim

 On 4 October 2010 13:52, Panyarak Ngamsritragul pa...@me.psu.ac.thwrote:


 Dear all,

 A couple of weeks ago I have posted questions about Google crawler and
 sitemaps.  There was a response from Vinit, but I still could not reach
 the solution to what I am experiencing.

 I am running 1.6.2 and have registered the site (kb.psu.ac.th) to
 Google's
 webmaster tools.  I understand that I have submitted the sitemaps.  After
 sometimes, I have repeatedly receiving Internal server error as a result
 of Google crawler trying to access some non-existence records.  Some of
 the records were repeatedly accessed by crawler for more than a month now.

 Can anyone help me to pin point the root cause of the problem?
 I have attaced here with one of the error messages.

 Thanks.

 Panyarak Ngamsritragul
 Prince of Songkla University.
 Thailand.

 -- Forwarded message --
 Date: Sun, 3 Oct 2010 18:50:06 +0700 (ICT)
 From: psukb-nore...@psu.ac.th
 To: psukb-h...@me.psu.ac.th
 Subject: PSUKB: Internal Server Error

 An internal server error occurred on http://kb.psu.ac.th/psukb:

 Date:   10/3/10 6:50 PM
 Session ID: D5E58233D9F2093B248C4CC5C65D96D1
 User:   Anonymous
 IP address: 66.249.69.1

 -- URL Was: http://kb.psu.ac.th:8080/psukb/displaystats?handle=2553/929
 -- Method: GET
 -- Parameters were:
 -- handle: 2553/929

 Exception:
 java.lang.NullPointerException
at
 org.dspace.app.webui.servlet.DisplayStatisticsServlet.displayStatistics(DisplayStatisticsServlet.java:182)
at
 org.dspace.app.webui.servlet.DisplayStatisticsServlet.doDSGet(DisplayStatisticsServlet.java:123)
at
 org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151)
at
 org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
 org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:112)
at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465)
at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:619)

 --
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.