[jira] [Comment Edited] (ANY23-504) Optionally disable remote HTTP connections when resolving XML entities

Lewis John McGibbney (Jira) Thu, 14 Oct 2021 10:36:22 -0700


    [ 
https://issues.apache.org/jira/browse/ANY23-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428474#comment-17428474
 ]


Lewis John McGibbney edited comment on ANY23-504 at 10/14/21, 5:35 PM:
-----------------------------------------------------------------------

Hi [~snagel] yes it helps a lot. My next question was to ask what any23 
extractors were activated via the 
[any23.extractors|https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1263-L1267]
 configuration setting. As you can see, by default we only have it set to 
_html-microdata_.
The behavior you are experiencing is directly inline with what I would expect 
if I activated the _*rdf-xml*_ extractor on a HTML document. 
This is validated by the Media Types defined within the [RDFXMLExtractorFactory 
constructor 
semantics|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/rdf/RDFXMLExtractorFactory.java#L42].

Let me state my thoughts. This is NOT a bug. It is however a problem. 

We could provide some sort of _break_ mechanism which would allow us to report 
to the client that an error has occurred as a result of the defined extractor 
being incapable of processing the input data.

Does that make sense? Thanks for sticking with me on this one...


was (Author: lewismc):
Hi [~snagel] yes it helps a lot. My next question was to ask what any23 
extractors were activated via the 
[any23.extractors|https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1263-L1267]
 configuration setting. As you can see, by default we only have it set to 
_html-microdata_.
The behavior you are experiencing is directly inline with what I would expect 
if I activated the _*rdf-xml*_ extractor on a HTML document. 
This is validated by the Media Types defined within the [RDFXMLExtractorFactory 
constructor 
semantics|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/rdf/RDFXMLExtractorFactory.java#L42].

Let me state my thoughts. This is NOT a bug. It is however a problem. Further I 
believe that we could provide some sort of _break_ mechanism which would allow 
us to report to the client that an error as a result of the extractor overrides 
not being suitable as extractor implementations for the given input data.

Does that make sense? Thanks for sticking with me on this one...

> Optionally disable remote HTTP connections when resolving XML entities
> ----------------------------------------------------------------------
>
>                 Key: ANY23-504
>                 URL: https://issues.apache.org/jira/browse/ANY23-504
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.6
>
>
> The Any23 parser should optionally avoid to open HTTP connections when 
> parsing XML.
> While testing the Nutch's Any23 plugin with 2.5 (NUTCH-2892) on the file 
> "BBC_News_Scotland.htm", the parser did hang for about two minutes with an 
> open HTTP connection to "hans-moleman.w3.org" and the following stack:
> {noformat}
> "parse-0" #19 daemon prio=5 os_prio=0 cpu=1432.93ms elapsed=15.85s 
> tid=0x00007efc713bd800 nid=0x16ff4 runnable  [0x00007efc29f2d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0([email protected]/Native 
> Method)
>         at 
> java.net.SocketInputStream.socketRead([email protected]/SocketInputStream.java:115)
>         at 
> java.net.SocketInputStream.read([email protected]/SocketInputStream.java:168)
>         at 
> java.net.SocketInputStream.read([email protected]/SocketInputStream.java:140)
>         at 
> java.io.BufferedInputStream.fill([email protected]/BufferedInputStream.java:252)
>         at 
> java.io.BufferedInputStream.read1([email protected]/BufferedInputStream.java:292)
>         at 
> java.io.BufferedInputStream.read([email protected]/BufferedInputStream.java:351)
>         - locked <0x000000071be1bb68> (a java.io.BufferedInputStream)
>         at 
> sun.net.www.http.HttpClient.parseHTTPHeader([email protected]/HttpClient.java:754)
>         at 
> sun.net.www.http.HttpClient.parseHTTP([email protected]/HttpClient.java:689)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0([email protected]/HttpURLConnection.java:1615)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream([email protected]/HttpURLConnection.java:1520)
>         - locked <0x000000071be11040> (a 
> sun.net.www.protocol.http.HttpURLConnection)
>         at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
> Source)
>         at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
> Source)
>         at 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser.parse(SimpleSAXParser.java:197)
>         - locked <0x000000071bfe6f28> (a 
> org.eclipse.rdf4j.common.xml.SimpleSAXParser)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:177)
>         at org.eclipse.rdf4j.rio.trix.TriXParser.parse(TriXParser.java:134)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:86)
>         at 
> org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:39)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:523)
>         at 
> org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:265)
>         at org.apache.any23.Any23.extract(Any23.java:315)
>         at org.apache.any23.Any23.extract(Any23.java:483)
>         at org.apache.any23.Any23.extract(Any23.java:345)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.parse(Any23ParseFilter.java:106)
>         at 
> org.apache.nutch.any23.Any23ParseFilter$Any23Parser.<init>(Any23ParseFilter.java:81)
>         at 
> org.apache.nutch.any23.Any23ParseFilter.filter(Any23ParseFilter.java:153)
>         at 
> org.apache.nutch.parse.HtmlParseFilters.filter(HtmlParseFilters.java:55)
>         at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:257)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
>         at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
>         at 
> java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run([email protected]/Thread.java:829)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ANY23-504) Optionally disable remote HTTP connections when resolving XML entities

Reply via email to