Hi, For TarMK, none of this is an issue as TarMK is all in memory on 1 JVM with local disk. Scaling up by throwing RAM and IO at the problem is a viable option, as far as it's safe/sensible to do so. But TarMK doesn't cluster, and if it did cluster, this would probably be an issue.
I think, but could easily be wrong, that in the case of MongoDB all modifications to indexes generated by a commit are persisted in a single batch request taking (ie 1 mongodb statement). The time take to process that request is dependent on the size of the request. Large requests can take seconds on large databases. Its not the distance between Oak and the database that matters, as only 1 mongodb statement us used, its the processing time of that statement in MongoDB that matters. With MongoDB setup correctly to not loose data, this statement must be written to a majority of replicas before processing can continue. MongoDB replication is sequential. Also, IIRC, the root document is not persisted on every commit, but synchronized periodically (once every second) similar to fsync on a disk. So the indexes (in fact all Oak Documents) are synchronous on the local Oak instance and are synchronous on remote Oak instances but with a minimum data latency of the root document sync rate (1s). IIUC the 1 second sync period is a performance optimisation as the root document must be updated by every commit and hence is a global singleton in an Oak cluster, and already hot as you point out in 3. I have been involved on the periphery of OAK-4638 and OAK-4412. For me, the main benefit is reducing the number of documents stored in the database. While it is true that the number of documents stored in the database doesn't matter for small numbers, with every document being counted inside Oak, and every document having an impact database performance, having around 66% of the documents not contributing to repository content storage reduces the ultimate capacity limit of an Oak repository by the same amount. 2/3rds. With many applications being built on top of Oak exploiting the deep content structure that Oak encourages and makes so easy, this limit rapidly becomes a reality. What limit ? A limit at which one of the components ceases to work. I don't know which one and when but it's there. A repository containing 100M content items may need 1E10 documents due to both the application implementation and synchronous indexing. Perhaps the application should fix itself, but so should Oak. Quite apart from all that, is embarrassingly wasteful to be using Oak documents in this way for non TarMK repos, rather like implementing Lucene in SQL. To recap. Addressing 1 and 2 are a requirement to reduce waste, increase performance of the update operations and increase data scalability. 3 is not an issue, the pressure is already there without any indexes. Every write has to update the root document for that update to become visible, by design. I am not a core Oak developer, just an observer, so if I got anything wrong, please someone correct me and I will learn from the experience. Best Regards Ian On 5 August 2016 at 18:04, Michael Marth <[email protected]> wrote: > Hi, > > I have noticed OAK-4638 and OAK-4412 – which both deal with particular > problematic aspects of property indexes. I realise that both issues deal > with slightly different problems and hence come to different suggested > solutions. > But still I felt it would be good to take a holistic view on the different > problems with property indexes. Maybe there is a unified approach we can > take. > > To my knowledge there are 3 areas where property indexes are problematic > or not ideal: > > 1. Number of nodes: Property indexes can create a large number of nodes. > For properties that are very common the number of index nodes can be almost > as large as the number of the content nodes. A large number of nodes is not > necessarily a problem in itself, but if the underlying persistence is e.g. > MongoDB then those index nodes (i.e. MongoDB documents) cause pressure on > MongoDB’s mmap architecture which in turn affects reading content nodes. > > 2. Write performance: when the persistence (i.e. MongoDB) and Oak are “far > away from each other” (i.e. high network latency or low throughput) then > synchronous property indexes affect the write throughput as they may cause > the payload to double in size. > > 3. I have no data on this one – but think it might be a topic: property > index updates usually cause commits to have / as the commit root. This > results on pressure on the root document. > > Please correct me if I got anything wrong or inaccurate in the above. > > My point is, however, that at the very least we should have clarity which > one go the items above we intend to tackle with Oak improvements. Ideally > we would have a unified approach. > (I realize that property indexes come in various flavours like unique > index or not, which makes the discussion more complex) > > my2c > Michael >
