[ 
https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259032#comment-13259032
 ] 

Hudson commented on ANY23-76:
-----------------------------

Integrated in Any23-trunk #178 (See 
[https://builds.apache.org/job/Any23-trunk/178/])
    Improved HCardExtractor performances. Related to issue #ANY23-76 . 
(Revision 1328663)

     Result = UNSTABLE
mostarda : 
Files : 
* 
/incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/DomUtils.java
* 
/incubator/any23/trunk/core/src/main/java/org/apache/any23/extractor/html/HCardExtractor.java
* 
/incubator/any23/trunk/core/src/test/java/org/apache/any23/extractor/html/HCardExtractorTest.java
* 
/incubator/any23/trunk/core/src/test/resources/microformats/hcard/performance.html

                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of 
> XPath in the DomUtils class cause Microformat extraction to be slow.   I've 
> market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor 
> on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  
> I hope someone who is more involved in the project can decide if it's a good 
> idea to use the patch or not, or maybe address this issue in another way..  
> The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. 
> getting all nodes with 'class' attributes.  On my machine the time to parse 
> the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to