[ 
https://issues.apache.org/jira/browse/CLEREZZA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13599772#comment-13599772
 ] 

Rupert Westenthaler commented on CLEREZZA-683:
----------------------------------------------

Status Update:

For long this implementation was not really stable, as there where sometimes 
unit test failure related to filtered iterations (The unit test creates a 
randomized Graph and compares iteration results of the SimpleMGrpah with the 
IndexedMGraph implemetnation).

The cause of those failures was long unclear, because they could not be 
replicated. Even when running the same test twice with the exact same graph. In 
addition those failures never happened on Mac OS.

Because of that the conclusion was that those are related to 
java.lang.Object#hasCode() conflicts of BNodes. Those conflicts would have 
corrupted the natural order of Triples in the SPO, POS, and OSP indexes and 
could therefore cause the reported failures. This cause was supported by the 
fact that Stanbol was not affected by this - because in Stanbol we do not use 
bNodes. 

With revision 1437077 [1] a workaround for this problem was introduced. The 
Comparators used to ensure the natural order do now check for hashCode() 
conflicts. If such an conflict is detected the according bNodes are stored in a 
map using the integer hashcode as key and the list of conflicting bNodes as 
values. The order in this list is than used to decide for the natural order of 
the bNodes in the SPO, POS, and OSP indexes.

With this addition the implementation seams to be stable. Therefore the 
suggestion is from the Apache Stanbol codebase over to Apache Clerezza. 

[1] http://svn.apache.org/viewvc?rev=1437077&view=rev
                
> Indexed in-memory graph
> -----------------------
>
>                 Key: CLEREZZA-683
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-683
>             Project: Clerezza
>          Issue Type: New Feature
>          Components: rdf.core
>            Reporter: Rupert Westenthaler
>
> # Indexed in-memory graph
> Implementation of a TripleCollection that internally manages SPO, POS, OSP 
> indexes for fast filtered iterators. The current state of development is 
> hosted at 
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/.
>  However the intention is that this module becomes direct part of clerezza. 
> ## Background:
> For Apache Stanbol having fast filtered iterators over in-memory graphs is 
> really important, because Stanbol uses in-memory graph to store extracted 
> metadata for parsed ContentItems.
> When enhancing longer texts with EnhancementChain configurations that produce 
> a lot of enhancements (e.g. keyword extraction based on dbpedia) such 
> in-memory graphs can get bigger than 100k triples. Especially if also triples 
> for suggested entities are included within the result.
> ## Implementation:
> Because of that I started to implement an TripleCollection that used TreeMaps 
> to manage SPO, POS, OSP indexes. 
> For fast sorting (comparator) I use the same Resource#hashCode 
> Resource#toString based solution as used in the rdf.rdfjson serializer. I 
> hope this is also sufficient for Literals (someone should check that).
> The implementation of the "filter(..)" method is purely based on 
> "NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to 
> ensure that by calls to Iterator.remove():
> 1) Triples are removed from all three indexes
> 2)  GraphEvents are dispatched correctly
> Note also the trick with the two static fields UriRef MIN and UriRef MAX used 
> to generate lower/upper bound triples as parsed to  "NavigableSet.subSet(..)".
> The implementation is currently hosted on 
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/
> It has no dependencies to Apache Stanbol. However users that do not want to 
> check-out Stanbol as a whole will need to edit the pom.xml file and provide 
> information usually imported from the parent poms.
> ## Tests:
> This implementation passes all MGraphTest UnitTests.
> In addition I have copied the tests define for SimpleTripleCollection
> To compare the performance I also implemented code that
> * allows to create a random Graph with n Triples
> * create a TestCase with configurable numbers of Subjects, Predicates and 
> Objects
> * performs than m calls to #filter(...)
> This performance test runs also as UnitTest
> 1. by using the SimpleMGraph implementation
> 2. by using the IndexedMGraph implementation
> NOTE: While implementing this I recognized that the 
> SimpleTripleCollectionTest does not extend MGraphTest and therefore the 
> SimpleTripleCollection class is not checked against the tests defined by 
> MGraphTest. This might actually an Issue!
> ## Performance
> This is a copy from a run of the above described PerformanceTest
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> Filter Performance Test (graph size 100000 triples, iterations 1000)
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  
> --- TEST SimpleMGraph with 100000 triples ---
> 10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,P,O] in 8321ms with 2 results
> 18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,P,n] in 7358ms with 734 results
> 25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,n,O] in 7266ms with 100 results
> 31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,P,O] in 6519ms with 232 results
> 39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,n,n] in 7398ms with 8030 results
> 45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,P,n] in 5934ms with 8318000 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,n,O] in 10666ms with 2260 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  
> --- TEST completed in 53463ms
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  
> --- TEST IndexedMGraph 100000 triples ---
> 55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,P,O] in 20ms with 2 results
> 55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,P,n] in 19ms with 734 results
> 55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,n,O] in 33ms with 100 results
> 55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,P,O] in 28ms with 232 results
> 55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [S,n,n] in 21ms with 8030 results
> 57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,P,n] in 1065ms with 8318000 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest - 
> ... run [n,n,O] in 8ms with 2260 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -  
> --- TEST completed in 1194ms
> best
> Rupert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to