Re: [Dspace-tech] Issue about Google crawler
Hi Panyarak, It might be an idea to add /displaystats to your JSPUI's robots.txt and to any Google Webmaster Tools robots.txt files or Page Removal Requests. For Google to de-index pages, it generally likes to see a 404 (not found) or a 410 (gone). Unfortunately, the servlet that handles statistics display for JSPUI throws a NullPointerException when a handle is passed to it that doesn't turn into a valid DSpace object. It *should* throw a friendly 404 to help crawlers like Google realise the page is gone. I've opened a JIRA issue for the NPE bug - http://jira.dspace.org/jira/browse/DS-689 - and attached a patch for 1.6.2 (and trunk, and probably other 1.6.x versions) that will make sure that when anyone (including Google) visits those pages, it sees a 404 instead of Internal Server Error. Hopefully this, along with /displaystats (and/or /displaystats* ?) in your robots.txts or removal requests will help convince Google to stop crawling. Cheers, Kim On 4 October 2010 13:52, Panyarak Ngamsritragul pa...@me.psu.ac.th wrote: Dear all, A couple of weeks ago I have posted questions about Google crawler and sitemaps. There was a response from Vinit, but I still could not reach the solution to what I am experiencing. I am running 1.6.2 and have registered the site (kb.psu.ac.th) to Google's webmaster tools. I understand that I have submitted the sitemaps. After sometimes, I have repeatedly receiving Internal server error as a result of Google crawler trying to access some non-existence records. Some of the records were repeatedly accessed by crawler for more than a month now. Can anyone help me to pin point the root cause of the problem? I have attaced here with one of the error messages. Thanks. Panyarak Ngamsritragul Prince of Songkla University. Thailand. -- Forwarded message -- Date: Sun, 3 Oct 2010 18:50:06 +0700 (ICT) From: psukb-nore...@psu.ac.th To: psukb-h...@me.psu.ac.th Subject: PSUKB: Internal Server Error An internal server error occurred on http://kb.psu.ac.th/psukb: Date: 10/3/10 6:50 PM Session ID: D5E58233D9F2093B248C4CC5C65D96D1 User: Anonymous IP address: 66.249.69.1 -- URL Was: http://kb.psu.ac.th:8080/psukb/displaystats?handle=2553/929 -- Method: GET -- Parameters were: -- handle: 2553/929 Exception: java.lang.NullPointerException at org.dspace.app.webui.servlet.DisplayStatisticsServlet.displayStatistics(DisplayStatisticsServlet.java:182) at org.dspace.app.webui.servlet.DisplayStatisticsServlet.doDSGet(DisplayStatisticsServlet.java:123) at org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151) at org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:112) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:619) -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -- Virtualization is moving to the mainstream and overtaking non-virtualized environment for deploying applications. Does it make network security easier or more difficult to achieve? Read this whitepaper to separate the two and get a better understanding. http://p.sf.net/sfu/hp-phase2-d2d ___ DSpace-tech mailing list
Re: [Dspace-tech] Issue about Google crawler
I should point out that my robots.txt suggestions assume you don't want any stats pages crawled at all... if that's not true, it's probably best to apply the patch for DS-689 and wait for Google to de-index (and make the robots.txt entries more specific if there are only a few invalid handles being requested) Cheers, Kim On 6 October 2010 00:30, Kim Shepherd kim.sheph...@gmail.com wrote: Hi Panyarak, It might be an idea to add /displaystats to your JSPUI's robots.txt and to any Google Webmaster Tools robots.txt files or Page Removal Requests. For Google to de-index pages, it generally likes to see a 404 (not found) or a 410 (gone). Unfortunately, the servlet that handles statistics display for JSPUI throws a NullPointerException when a handle is passed to it that doesn't turn into a valid DSpace object. It *should* throw a friendly 404 to help crawlers like Google realise the page is gone. I've opened a JIRA issue for the NPE bug - http://jira.dspace.org/jira/browse/DS-689 - and attached a patch for 1.6.2 (and trunk, and probably other 1.6.x versions) that will make sure that when anyone (including Google) visits those pages, it sees a 404 instead of Internal Server Error. Hopefully this, along with /displaystats (and/or /displaystats* ?) in your robots.txts or removal requests will help convince Google to stop crawling. Cheers, Kim On 4 October 2010 13:52, Panyarak Ngamsritragul pa...@me.psu.ac.thwrote: Dear all, A couple of weeks ago I have posted questions about Google crawler and sitemaps. There was a response from Vinit, but I still could not reach the solution to what I am experiencing. I am running 1.6.2 and have registered the site (kb.psu.ac.th) to Google's webmaster tools. I understand that I have submitted the sitemaps. After sometimes, I have repeatedly receiving Internal server error as a result of Google crawler trying to access some non-existence records. Some of the records were repeatedly accessed by crawler for more than a month now. Can anyone help me to pin point the root cause of the problem? I have attaced here with one of the error messages. Thanks. Panyarak Ngamsritragul Prince of Songkla University. Thailand. -- Forwarded message -- Date: Sun, 3 Oct 2010 18:50:06 +0700 (ICT) From: psukb-nore...@psu.ac.th To: psukb-h...@me.psu.ac.th Subject: PSUKB: Internal Server Error An internal server error occurred on http://kb.psu.ac.th/psukb: Date: 10/3/10 6:50 PM Session ID: D5E58233D9F2093B248C4CC5C65D96D1 User: Anonymous IP address: 66.249.69.1 -- URL Was: http://kb.psu.ac.th:8080/psukb/displaystats?handle=2553/929 -- Method: GET -- Parameters were: -- handle: 2553/929 Exception: java.lang.NullPointerException at org.dspace.app.webui.servlet.DisplayStatisticsServlet.displayStatistics(DisplayStatisticsServlet.java:182) at org.dspace.app.webui.servlet.DisplayStatisticsServlet.doDSGet(DisplayStatisticsServlet.java:123) at org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151) at org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:617) at javax.servlet.http.HttpServlet.service(HttpServlet.java:717) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:112) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:465) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:619) -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.