[ 
https://issues.apache.org/jira/browse/ANY23-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13258856#comment-13258856
 ] 

Michele Mostarda commented on ANY23-76:
---------------------------------------

Hi Tim, 
  I applied your patch and verified performances, on my Mac (   2,8 GHz Intel 
Core 2 Duo,  Memory  8 GB 1067 MHz DDR3 ) with default JVM configuration ( no 
heap size specified) I just observed a 2x performance improvement (from 21sec 
to 9sec) on the same input page you reported, enough in my opinion to integrate 
the patch. 
Thanks a lot.
The best.
                
> Improve runtime of the Microformat extractor on documents with many relations.
> ------------------------------------------------------------------------------
>
>                 Key: ANY23-76
>                 URL: https://issues.apache.org/jira/browse/ANY23-76
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Timothy Potter
>            Assignee: Michele Mostarda
>            Priority: Trivial
>         Attachments: MicroformatSpeed.patch
>
>
> For some large documents with many Microformat tuples the extensive use of 
> XPath in the DomUtils class cause Microformat extraction to be slow.   I've 
> market this as trivial as it's a corner case. 
> To reproduce the problem the patch addresses, run the Microformat extractor 
> on the folloing url:
> http://en.wikipedia.org/wiki/List_of_Nike_missile_locations
> I include a patch that improves performance at the cost of code simplicity.  
> I hope someone who is more involved in the project can decide if it's a good 
> idea to use the patch or not, or maybe address this issue in another way..  
> The patch replaces commonly used XPath queries with DOM tree traversals.  Eg. 
> getting all nodes with 'class' attributes.  On my machine the time to parse 
> the given document is reduced from around 105 seconds to 14 seconds.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to