[ 
https://issues.apache.org/jira/browse/ANY23-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424176#comment-17424176
 ] 

ASF GitHub Bot commented on ANY23-504:
--------------------------------------

lewismc opened a new pull request #205:
URL: https://github.com/apache/any23/pull/205


   *Context*
   This PR is a WIP.
   The unit test attempt sot perform a simple document extraction using the BBC 
Scotland HTML as input.
   
   *How to debug*
   One can inspect the `TriXExtractor` issues by setting a breakpoint at 
[org/apache/any23/extractor/SingleDocumentExtraction.java#L543](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L543).
 You can then evaluate the following expression
   
   ```
   extractionResult.getIssues().toArray()[1]
   ```
   
   This indicates the following
   
   ```
   FATAL:       'org.eclipse.rdf4j.rio.RDFParseException: The attribute name 
must be specified in the attribute-list declaration for element "charset". 
[line 181, column 45]
        at 
org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportFatalError(RDFParserHelper.java:333)
        at 
org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportFatalError(AbstractRDFParser.java:724)
        at 
org.eclipse.rdf4j.rio.trix.TriXParser.reportFatalError(TriXParser.java:253)
        at org.eclipse.rdf4j.rio.trix.TriXParser.fatalError(TriXParser.java:419)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
        at org.apache.xerces.impl.XMLDTDScannerImpl.scanAttlistDecl(Unknown 
Source)
        at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
        at org.apa...'  (-1,-1)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Optionally disable remote HTTP connections when resolving XML entities
> ----------------------------------------------------------------------
>
>                 Key: ANY23-504
>                 URL: https://issues.apache.org/jira/browse/ANY23-504
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.6
>
>
> The Any23 parser should optionally avoid to open HTTP connections when 
> parsing XML.
> While testing the Nutch's Any23 plugin with 2.5 (NUTCH-2892) on the file 
> "BBC_News_Scotland.htm", the parser did hang for about two minutes with an 
> open HTTP connection to "hans-moleman.w3.org" and the following stack:
> {noformat}
> "parse-0" #19 daemon prio=5 os_prio=0 cpu=1432.93ms elapsed=15.85s 
> tid=0x00007efc713bd800 nid=0x16ff4 runnable  [0x00007efc29f2d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0([email protected]/Native 
> Method)
>         at 
> java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
>         at 
> java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
>         at 
> java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
>         at 
> java.io.BufferedInputStream.fill([email protected]/BufferedInputStream.java:252)
>         at 
> java.io.BufferedInputStream.read1([email protected]/BufferedInputStream.java:292)
>         at 
> java.io.BufferedInputStream.read([email protected]/BufferedInputStream.java:351)
>         - locked <0x000000071be1bb68> (a java.io.BufferedInputStream)
>         at 
> sun.net.www.http.HttpClient.parseHTTPHeader([email protected]/HttpClient.java:754)
>         at 
> sun.net.www.http.HttpClient.parseHTTP([email protected]/HttpClient.java:689)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0([email protected]/HttpURLConnection.java:1615)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream([email protected]/HttpURLConnection.java:1520)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
> Source)
>         at 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser.parse(SimpleSAXParser.java:197)
>         - locked <0x000000071bfe6f28> (a 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:177)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:134)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:86)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:39)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:523)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:265)
>         at org.apache.any23.Any23.extract(Any23.java:315)
>         at org.apache.any23.Any23.extract(Any23.java:483)
>         at org.apache.any23.Any23.extract(Any23.java:345)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:106)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:81)
>         at 
> org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:153)
>         at 
> org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:55)
>         at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
>         at 
> java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run([email protected]/Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to