[ 
https://issues.apache.org/jira/browse/ANY23-75?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13254621#comment-13254621
 ] 

Timothy Potter commented on ANY23-75:
-------------------------------------

Thanks Michele.

I'll explain the change anyway:

  The current implementation is slow on pages with lots of items because it 
does nested iterations over all itemscope and itemprops under the given 
scopeNode. In the inner loop it builds XPaths strings for each node to test if 
the nodes are related.  This test condition is inherent in the tree structure 
of the DOM.  The patch changes the code to do a traversal of the DOM tree 
limited to nodes only in the given scopeNode.   Reading the code itself is 
probably the best way to understanding the change.
                
> Improve runtime of the Microdata extractor on documents with many relations.
> ----------------------------------------------------------------------------
>
>                 Key: ANY23-75
>                 URL: https://issues.apache.org/jira/browse/ANY23-75
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Timothy Potter
>             Fix For: 0.7.0
>
>         Attachments: MicrodataParser.diff
>
>
> I've been running Any23 on a big web crawler dump.  I found for certain 
> documents with a lot of Microdata relations the method 
> MicrodataParser.getItemProps() becomes very slow. As a result, processing one 
> document can take several minutes.   An example of a problematic page can be 
> seen here: http://dreamtime.fftunes.com/
> I'll attach a patch for the method that greatly improves the performance of 
> this method.  I was wondering if someone could have a look at it and include 
> it in the next release if possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to