I'm certainly intrigued by Spark, I just don't know much about it. Hazelcast is attractive because it lets me think in comfortable abstractions like the Java Collections Framework. But maybe that is limiting... time to take a look at:
https://spark.apache.org/docs/latest/graphx-programming-guide.html ajs6f > On Jan 23, 2017, at 1:08 PM, Andy Seaborne <[email protected]> wrote: > > Hazelcast offers scan+filter (liek the original MySQL!) which can be a great > help, pushing filters over to the storage. Combined with sort, then merge > joins in the client (bonus: with skipping sections of index) would be > interesting. > > If going this route, what about Apache Spark? More powerful operators. > > Andy > > On 22/01/17 16:54, A. Soroka wrote: >> Another idea with which I have been playing is to try to scale horizontally >> but only in-memory: >> >> I could take the one-writeble-graph-per-transaction dataset code I've >> written and replace the ConcurrentHashMap that currently holds the graphs >> with a Hazelcast [1] distributed map. Naive union graph performance would be >> awful, but if the workload was chiefly addressing individual graphs and the >> graphs were large enough, the parallelism might be really worthwhile. >> >> Hazelcast offers per-entry locks [2], so those could be used instead of the >> lockable graphs I'm using now. It also also offers optimistic locking via >> Map.replace( key, oldValue, newValue ), so I could even imagine offering a >> switch between "strict mode" in which locks are used and "read-heavy mode" >> in which it is assumed that the application will prevent contention on >> individual graphs but that an update could fail if that isn't so. >> >> Hazelcast also offers some support for remote computation at the entries of >> its distributed maps [3], so it might be possible to distribute >> findInSpecificNamedGraph() executions (maybe eventually some of the ARQ >> execution as well?). It also supports a kind of query language [4] that >> might be used to obtain more efficiency, perhaps by using Bloom filters for >> graphs, as Claude has discussed before. >> >> All just food for thought, for now. >> >> --- >> A. Soroka >> >> [1] https://hazelcast.org/ >> [2] >> http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#locking-maps >> [3] >> http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#entry-processor >> [4] >> http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#distributed-query >> >>> On Jan 20, 2017, at 8:38 PM, De Gyves <[email protected]> wrote: >>> >>> I'd like to participate on the storage portion of Jena, maybe TDB. As I >>> have worked many years developing with RBDMS I like to explore new >>> horizonts of persistence and graph based ones seem very promising to my >>> next projects, so i'd like to use SPARQL and RDF with Jena/TDB and see how >>> far I can go. >>> >>> So I've spent the last two days exploring subjects of the mail archives >>> from august 2015 to january of this year the of jena-dev and found some >>> interesting threads, as the development of TDB2, the tests of 100m of BSBM >>> data, a question of horizontal scaling, and that anything that implements >>> DatasetGraph can be used for a triples store. Some readings of jena doc >>> include: SPARQL, The RDF API, Txn and TDB transactions. >>> >>> What I am looking for is to get a clear perspective of some requirements >>> which are taken for granted on a traditional RDBMS. These are: >>> >>> 1. Atomicity, consistency, isolation and durability of a transaction on a >>> single tdb database: Apart from the limitations on the documentation of TDB >>> Transactions and Txn, there are current issues? edge cases detected and >>> not yet covered? >>> 2. Are there currently available strategies to achieve a horizontal-scaled >>> tdb database? >>> 3. What do you think of try to implement a horizontal scalability with >>> DatasetGraph or something else with, let's say, cockroachdb, voltdb, >>> postgresql, etc? >>> 4. If there are some stress tests available, e.g. I read about a 100M of >>> BSBM test, is it included in the src? or may I have a copy of it? I'd like >>> to see what the limits are of the current TDB, and maybe of TDB2: maximum >>> size on disk of a dataset, max number of nodes on a dataset, of models or >>> graphs on a dataset, the limiting behavior of a typical read/write >>> transaction vs. the number of nodes, datasets, etcetera. Or, some >>> guidelines, so I can start to create this stress code. Will it be useful to >>> you also? >>> >>> -- >>> Víctor-Polo de Gyvés Montero. >>> +52 (55) 4926 9478 (Cellphone in Mexico city) >>> Address: Daniel Delgadillo 7 6A, Agricultura neighborhood, Miguel Hidalgo >>> burough >>> ZIP: 11360, México City. >>> >>> http://degyves.googlepages.com >>
