Re: [DISCUSS] Manage millions of identities

Francesco Chicchiriccò Mon, 29 Oct 2018 03:27:34 -0700

Hi Guido,

On 24/10/18 20:51, Guido Wimmel wrote:

Hi,
Am 16.10.18 um 10:54 schrieb Francesco Chicchiriccò:
Hi all,
I think it's time to discuss about how we want to get prepared forscenarios where the number of identities (mostly users, for the vastmajority) to manage is considerably high - from 1 million to above;the typical case being CIAM (Customer IAM).
In the IdM deliveries I've been involved so far, scaling ApacheSyncope up to hundreds of thousands of identities is not trivial, butdoable: naturally, most of optimization work shall be done at DBMSlevel, as that is obviously the component which is stressed more.
I think we can agree about the fact that, in such scenarios, the mostcritical data are the ones bound to the actual identities (hence noconnectors, resources, tasks, reports or any other configuration):consider that with 1 million users and 10 attributes for each user,we have the following table sizing to deal with:
* SyncopeUser: 1M rows
* UPlainAttr: 10M rows
* UPlainAttrValue: 10M rows
Moreover, the search views [1] are all on the same size order(although one can enable the Elasticsearch extension in such cases,to improve performances).
I think this is what we need to change in order to get better results.
So far, I have been able to think of a couple of possibilities:
1. Leverage the JSON column support provided by PostgreSQL [2], MySQL[3], SQL Server [4] and Oracle DB [5] to extend the currentOpenJPA-based persistence layer
Pros:
* reduce the sizing problems by removing the need of UPlainAttr andUPlainAttrValue tables, search views and joins * limited implementation effort, as most of the current JPA layercan be retained * keep enjoying the benefits of referential integrity and otherconstraints enforced by DBMS (including UNIQUE)
Cons:
* each DBMS provides JSON support in its own fashion: implementationwouldn't be trivial (while we can make it incremental, and addsupport for one DBMS at a time) * scaling capabilities and performance might be overrated - eventhough there seems to be very nice references, at least forPostgreSQL [6][7]
2. Implement a new persistence layer based on a different technology- I have done some experiments with Apache Cassandra [8] and theDatastax Java Driver [9]
Pros:
 * built native for scalability and high availability
 * proven and widespread adoption
* Object Mapper [10] allows to semi-transparently convert betweenquery results and domain, somehow similar as JPA's EntityManager
Cons:
* relations are obviously not available, only custom types [11]: thepersistence model shall be redesigned to cope with such situation * constraints are not available - more specifically UNIQUE, whichwill require additional code handling * implementation effort: all the persistence layer shall be redone,not only identity-related entities as User, UPlainAttr,UPlainAttrValue...
Besides the two above, there are of course other options in the NoSQLworld (Neo4j, MongoDB, ...), but I am afraid they all present similarchallenges as Cassandra.
WDYT?
Regards.
I'd expect it should be possible to make the current relational modelwork for hundreds of thousands - millions of identities. This shouldnot be too much data for enterprise-grade databases like Oracle orPostgreSQL.We have a deployment with approx. a million identities (however, wemostly use basic features of Syncope, and had to do some tweaking onthe search queries).Maybe one could document the required optimizations / partiallyintegrate them into Syncope? (possibly additional indexes / optimizedqueries / ...)

First of all, this is an interesting confirmation: (1) the current modelcan handle (in your experience) "hundreds of thousands - millions ofidentities" and (2) you have a deployment with approx. a million identities.

Documenting the required optimizations, or integrating something intoSyncope is definitely worthwhile: could you share these somehow, even asdescriptions into an improvement on JIRA?

For even larger numbers, I'd find both suggestions interesting. Ithink one would have to do spikes in order to evaluate the performancegain for large numbers of identities for different functionalities.Maybe one could even support both, so that users could chooseaccording to their requirements / risk tolerance.

I am currently in the middle of a spike which leverages PostgreSQL'sJSONB data type to replace *PlainAttr / * PlainAttrValue, and I amaround 90% feature-wise.After that, I would also like to add a new module to the sources, withpurpose of running performance tests with JMeter support: in this way wewill be able to effectively check the numbers of the availableimplementations.

Anyway, given the way how the code is structured from Syncope 2.0onwards, we are simply providing different implementations of theinterfaces in syncope-core-persistence-api:


* syncope-core-persistence-jpa is the current implementation

* syncope-core-persistence-jpa-pgjsonb could be the name of the one I amworking on (which is actually an extension of the former)* syncope-core-persistence-jpa-mysqljson could be the same approach asabove, but for MySQL's JSON data type* syncope-core-persistence-cassandra, syncope-core-persistence-mongodb,syncope-core-persistence-whatever could be provided at any point in time

This to confirm that, in response to your suggestion and to concernsraised in other e-mails of this thread, there is always room to providenew implementations to support virtually any persistence technology; andthat users will be free to choose one based on their needs by simplyselecting the correct Maven dependency to include in their own projects.


Regards.

[1]https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
[2] https://www.postgresql.org/docs/10/static/functions-json.html
[3] https://dev.mysql.com/doc/refman/8.0/en/json.html
[4]https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
[5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
[6]https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf[7]http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
[8] http://cassandra.apache.org/
[9] https://github.com/datastax/java-driver
[10]https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/[11]http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types


--
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Reply via email to