Christophe Dupriez created ANY23-115:
----------------------------------------

             Summary: Empty spans seem to break ANY23
                 Key: ANY23-115
                 URL: https://issues.apache.org/jira/browse/ANY23-115
             Project: Apache Any23
          Issue Type: Bug
          Components: html-scraper
         Environment: Any23.org public scraper
            Reporter: Christophe Dupriez


One of the 2 thousand URLs with the problem:
http://www.oceanexpert.net/viewMemberRecord.php?&memberID=20045

The piece of HTML creating the problem seems to be:
<h1>
                                Details of<span itemprop="name"> <span 
itemprop="honorificPrefix"></span>&nbsp;<span 
itemprop="givenName">Laury</span>&nbsp; <span 
itemprop="familyName">Miller</span></span>
                                                        </h1>
(this may disappear as we may workaround the problem)

Error message:
Internal error.
================================================================
java.lang.IllegalArgumentException: Invalid content ''
        at 
org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)
        at 
org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)
        at 
org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)
        at 
org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)
        at 
org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)
        at 
org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)
        at 
org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)
        at 
org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)
        at 
org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)
        at 
org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)
        at org.apache.any23.Any23.extract(Any23.java:294)
        at org.apache.any23.Any23.extract(Any23.java:446)
        at 
org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)
        at org.apache.any23.servlet.Servlet.doGet(Servlet.java:74)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at 
com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
        at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
        at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:662)
================================================================

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to