Hi,
For TarMK, none of this is an issue as TarMK is all in memory on 1 JVM with
local disk. Scaling up by throwing RAM and IO at the problem is a viable
option, as far as it's safe/sensible to do so. But TarMK doesn't cluster,
and if it did cluster, this would probably be an issue.

I think, but could easily be wrong, that in the case of MongoDB all
modifications to indexes generated by a commit are persisted in a single
batch request taking (ie 1 mongodb statement). The time take to process
that request is dependent on the size of the request. Large requests can
take seconds on large databases. Its not the distance between Oak and the
database that matters, as only 1 mongodb statement us used, its the
processing time of that statement in MongoDB that matters. With MongoDB
setup correctly to not loose data, this statement must be written to a
majority of replicas before processing can continue. MongoDB replication is
sequential.

Also, IIRC, the root document is not persisted on every commit, but
synchronized periodically (once every second) similar to fsync on a disk.
So the indexes (in fact all Oak Documents) are synchronous on the local Oak
instance and are synchronous on remote Oak instances but with a minimum
data latency of the root document sync rate (1s). IIUC the 1 second sync
period is a performance optimisation as the root document must be updated
by every commit and hence is a global singleton in an Oak cluster, and
already hot as you point out in 3.

I have been involved on the periphery of OAK-4638 and OAK-4412. For me, the
main benefit is reducing the number of documents stored in the database.
While it is true that the number of documents stored in the database
doesn't matter for small numbers, with every document being counted inside
Oak, and every document having an impact database performance, having
around 66% of the documents not contributing to repository content storage
reduces the ultimate capacity limit of an Oak repository by the same
amount. 2/3rds. With many applications being built on top of Oak exploiting
the deep content structure that Oak encourages and makes so easy, this
limit rapidly becomes a reality. What limit ? A limit at which one of the
components ceases to work. I don't know which one and when but it's there.
A repository containing 100M content items may need 1E10 documents due to
both the application implementation and synchronous indexing. Perhaps the
application should fix itself, but so should Oak.

Quite apart from all that, is embarrassingly wasteful to be using Oak
documents in this way for non TarMK repos, rather like implementing Lucene
in SQL.

To recap.
Addressing 1 and 2 are a requirement to reduce waste, increase performance
of the update operations and increase data scalability.
3 is not an issue, the pressure is already there without any indexes. Every
write has to update the root document for that update to become visible, by
design.

I am not a core Oak developer, just an observer, so if I got anything
wrong, please someone correct me and I will learn from the experience.

Best Regards
Ian







On 5 August 2016 at 18:04, Michael Marth <[email protected]> wrote:

> Hi,
>
> I have noticed OAK-4638 and OAK-4412 – which both deal with particular
> problematic aspects of property indexes. I realise that both issues deal
> with slightly different problems and hence come to different suggested
> solutions.
> But still I felt it would be good to take a holistic view on the different
> problems with property indexes. Maybe there is a unified approach we can
> take.
>
> To my knowledge there are 3 areas where property indexes are problematic
> or not ideal:
>
> 1. Number of nodes: Property indexes can create a large number of nodes.
> For properties that are very common the number of index nodes can be almost
> as large as the number of the content nodes. A large number of nodes is not
> necessarily a problem in itself, but if the underlying persistence is e.g.
> MongoDB then those index nodes (i.e. MongoDB documents) cause pressure on
> MongoDB’s mmap architecture which in turn affects reading content nodes.
>
> 2. Write performance: when the persistence (i.e. MongoDB) and Oak are “far
> away from each other” (i.e. high network latency or low throughput) then
> synchronous property indexes affect the write throughput as they may cause
> the payload to double in size.
>
> 3. I have no data on this one – but think it might be a topic: property
> index updates usually cause commits to have / as the commit root. This
> results on pressure on the root document.
>
> Please correct me if I got anything wrong  or inaccurate in the above.
>
> My point is, however, that at the very least we should have clarity which
> one go the items above we intend to tackle with Oak improvements. Ideally
> we would have a unified approach.
> (I realize that property indexes come in various flavours like unique
> index or not, which makes the discussion more complex)
>
> my2c
> Michael
>

Reply via email to