Re: [DISCUSS] Manage millions of identities

Andrea Patricelli Fri, 19 Oct 2018 01:08:08 -0700

Hi,

Il 16/10/2018 10:54, Francesco Chicchiriccò ha scritto:

Hi all,
I think it's time to discuss about how we want to get prepared forscenarios where the number of identities (mostly users, for the vastmajority) to manage is considerably high - from 1 million to above;the typical case being CIAM (Customer IAM).
In the IdM deliveries I've been involved so far, scaling ApacheSyncope up to hundreds of thousands of identities is not trivial, butdoable: naturally, most of optimization work shall be done at DBMSlevel, as that is obviously the component which is stressed more.
I think we can agree about the fact that, in such scenarios, the mostcritical data are the ones bound to the actual identities (hence noconnectors, resources, tasks, reports or any other configuration):consider that with 1 million users and 10 attributes for each user, wehave the following table sizing to deal with:
* SyncopeUser: 1M rows
* UPlainAttr: 10M rows
* UPlainAttrValue: 10M rows
Moreover, the search views [1] are all on the same size order(although one can enable the Elasticsearch extension in such cases, toimprove performances).
I think this is what we need to change in order to get better results.
So far, I have been able to think of a couple of possibilities:
1. Leverage the JSON column support provided by PostgreSQL [2], MySQL[3], SQL Server [4] and Oracle DB [5] to extend the currentOpenJPA-based persistence layer
Pros:
* reduce the sizing problems by removing the need of UPlainAttr andUPlainAttrValue tables, search views and joins * limited implementation effort, as most of the current JPA layercan be retained * keep enjoying the benefits of referential integrity and otherconstraints enforced by DBMS (including UNIQUE)
Cons:
* each DBMS provides JSON support in its own fashion: implementationwouldn't be trivial (while we can make it incremental, and add supportfor one DBMS at a time) * scaling capabilities and performance might be overrated - eventhough there seems to be very nice references, at least for PostgreSQL[6][7]
2. Implement a new persistence layer based on a different technology -I have done some experiments with Apache Cassandra [8] and theDatastax Java Driver [9]
Pros:
 * built native for scalability and high availability
 * proven and widespread adoption
* Object Mapper [10] allows to semi-transparently convert betweenquery results and domain, somehow similar as JPA's EntityManager
Cons:
* relations are obviously not available, only custom types [11]: thepersistence model shall be redesigned to cope with such situation * constraints are not available - more specifically UNIQUE, whichwill require additional code handling * implementation effort: all the persistence layer shall be redone,not only identity-related entities as User, UPlainAttr,UPlainAttrValue...
Besides the two above, there are of course other options in the NoSQLworld (Neo4j, MongoDB, ...), but I am afraid they all present similarchallenges as Cassandra.
WDYT?

I would tend for *solution 1*, since relational and SQL paradigm isstill wide spread I think that it is necessary support millions ofentities also on a relational database.

Nevertheless I would also put some effort doing (at least) some advancedspike with the most widely used no-sql technologies like ApacheCassandra, MongoDB, Apache CouncDB(?), Neo4j.They could be, maybe, the best solution for larger environments. Aboutrelations and constraints: in my (very little) experience with noSQLtechnologies (mainly elasticsearch) I found that these very nicefeatures of relational paradigm are often enhanced/supported becausehighly requested by users. I'm referring to [12] [13] [14] [15]. ThoughI share with you the same doubts on moving to a new noSQL (and lesstested) persistence layer.


[12] https://docs.mongodb.com/manual/applications/data-models-relationships/

[13]https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html

[14] https://docs.mongodb.com/manual/core/index-unique/

[15]https://medium.com/@mustwin/cassandra-from-a-relational-world-7bbdb0a9f1d

Regards.
[1]https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
[2] https://www.postgresql.org/docs/10/static/functions-json.html
[3] https://dev.mysql.com/doc/refman/8.0/en/json.html
[4]https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
[5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
[6]https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf[7]http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
[8] http://cassandra.apache.org/
[9] https://github.com/datastax/java-driver
[10]https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/[11]http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types

--
Dott. Andrea Patricelli
Tel. +39 3204524292

Engineer @ Tirasa S.r.l.
Viale Vittoria Colonna 97 - 65127 Pescara
Tel +39 0859116307 / FAX +39 0859111173
http://www.tirasa.net

Apache Syncope PMC Member

Re: [DISCUSS] Manage millions of identities

Reply via email to