[ 
https://issues.apache.org/jira/browse/ANY23-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated ANY23-115:
---------------------------------------

    Attachment: 0001-ANY23-115-Empty-spans-seem-to-break-ANY23.patch

This patch adds some members to the group of elements which contain the 'src' 
attribute as well as removing all whitespaces and non visible characters such 
as tab, \n, etc from extracted microdata. The empty spans were a problem which 
I hope we have fixed in this issue.
I'll commit and can revert if required. 
                
> Empty spans seem to break ANY23
> -------------------------------
>
>                 Key: ANY23-115
>                 URL: https://issues.apache.org/jira/browse/ANY23-115
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: html-scraper, microdata
>    Affects Versions: 0.7.0
>         Environment: Any23.org public scraper
>            Reporter: Christophe Dupriez
>             Fix For: 0.9.0
>
>         Attachments: 0001-ANY23-115-Empty-spans-seem-to-break-ANY23.patch, 
> json-pretty-printer.html
>
>
> One of the 2 thousand URLs with the problem:
> http://www.oceanexpert.net/viewMemberRecord.php?&memberID=20045
> The piece of HTML creating the problem seems to be:
> <h1>
>                               Details of<span itemprop="name"> <span 
> itemprop="honorificPrefix"></span>&nbsp;<span 
> itemprop="givenName">Laury</span>&nbsp; <span 
> itemprop="familyName">Miller</span></span>
>                                                       </h1>
> (this may disappear as we may workaround the problem)
> Error message:
> Internal error.
> ================================================================
> java.lang.IllegalArgumentException: Invalid content ''
>       at 
> org.apache.any23.extractor.microdata.ItemPropValue.<init>(ItemPropValue.java:89)
>       at 
> org.apache.any23.extractor.microdata.MicrodataParser.getPropertyValue(MicrodataParser.java:341)
>       at 
> org.apache.any23.extractor.microdata.MicrodataParser.getItemProps(MicrodataParser.java:394)
>       at 
> org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:471)
>       at 
> org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:186)
>       at 
> org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:203)
>       at 
> org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:100)
>       at 
> org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:62)
>       at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:477)
>       at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:260)
>       at org.apache.any23.Any23.extract(Any23.java:294)
>       at org.apache.any23.Any23.extract(Any23.java:446)
>       at 
> org.apache.any23.servlet.WebResponder.runExtraction(WebResponder.java:113)
>       at org.apache.any23.servlet.Servlet.doGet(Servlet.java:74)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
>       at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
>       at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
>       at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>       at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>       at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>       at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>       at 
> com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30)
>       at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>       at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>       at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>       at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
>       at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
>       at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>       at java.lang.Thread.run(Thread.java:662)
> ================================================================

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to