> Probably a lot of codesign is possible if the index type is fixed leading to > simpler and better performant code, but it would be nice to have a pluggable > approach so that you can pick the structures that work for your workload.
This is certainly very true. I am trying to develop some good abstractions to this end. So far the only sticking point is DatasetGraph::listGraphNodes. For the moment, I have added a listGraphNodes method to the index abstraction (alongside a find method that returns Iterator<Quad> to support DatasetGraph::find and ::findNG), because a map-based approach is just O(1) into iteration and I do want to be able to take advantage of that performance. But I will continue to look for a better approach. --- A. Soroka The University of Virginia Library On Aug 31, 2015, at 11:20 AM, Paul Houle <[email protected]> wrote: > The thing about bloom filters is that they are (i) highly dense because > they are bit packed and (ii) accessed in sequential order so that modern > processors can fetch ahead and hide RAM latency. > > Most kinds of indexes, on the other hand, involve pointer chasing over an > area that does not fit in the cache, so your brand new processor is > performing like a Pentium II. Even if the O() part is better for some > other algorithm, the constant in front is awesome for the bloom filter. > > Probably a lot of codesign is possible if the index type is fixed leading > to simpler and better performant code, but it would be nice to have a > pluggable approach so that you can pick the structures that work for your > workload. Also from a research perspective we could get real questions > > There are a lot of patents in the 2000 or so time period on systems that > profile the queries and automatically build indexes. For instance, > Salesforce.com has a patent, and they also have an implementation where > the "master copy" of data is stored in an Oracle table that looks a lot > like a triple store and then it automatically creates materialized views > and functional indexes in Oracle to support client workloads. > > (I remember writing a query against one of the larger Salesforce orgs and > being frustrated when it timed out, only to run the same query two minutes > later and have it work perfectly.) > > Patents do expire and it won't be too long that this technology could make > it to open source tools. > > On Mon, Aug 31, 2015 at 10:04 AM, Claude Warren <[email protected]> wrote: > >> step 3 is relatively expensive if there are a log of quads with the G and S >> values (e.g. *,G,*,S) but few with G,S,*,* >> >> Step 3 is about removing the false positives from the bloom filter. It >> does not require an index, it requires checking the values to ensure match. >> >> While there is a time to space trade off here. The time is not as much as >> one would think. Keep in mind that a bloomfilter check is very fast. On >> the other hand the traversal of a b-tree or other index is much more >> computationally complex. >> >> For example, I know that if you can devise a mechanism by which you can >> index bloom filters such that you have a 2 segment key and you know that >> you want to check every key where the compound keys are greater than the >> one you are looking for it is often faster to do the linear search through >> the data even when dealing with 1 million entries. There are special cases >> where the index is faster but for the most part the linear search is >> faster. >> >> Claude >> >> On Mon, Aug 31, 2015 at 1:53 PM, [email protected] <[email protected]> >> wrote: >> >>> I'm still a bit confused as to why you don't regard step 3 as being >>> potentially very expensive. In order to verify a match, we will have to >>> examine an "exact" index, and that (as Andy remarked) is likely to >> require >>> traversal, or else we throw away all the space gains. >>> >>> Is this technique a way to pay a lot of time for a lot of space savings? >>> Perhaps it is appropriate for an alternative implementation for very >> large >>> datasets? >>> >>> --- >>> A. Soroka >>> The University of Virginia Library >>> >>> On Aug 31, 2015, at 6:48 AM, Claude Warren <[email protected]> wrote: >>> >>>> to find find(G,S,*,*) with bloom filters and return an iterator you >>>> >>>> 1. construct a bloom filter with G and S >>>> 2. scan the list of quads checking for matches. >>>> 3. for each result that matches verify that it has G and S (I have >>> done this with an extended iterator in Jena) >>>> >>>> result is an iterator that returns all (G,S,*,*) quads. >>>> >>>> similar tests can be performed for any pattern -- same code used. >>>> >>>> Step 2 is the expensive one. But the bloom filter check is so >> efficient >>>> that it becomes very difficult to perform search operations in less >> time >>>> than it takes to scan the list. >>>> >>>> Claude >>>> >>>> On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <[email protected]> >> wrote: >>>> >>>>> On 29/08/15 14:55, Claude Warren wrote: >>>>> >>>>>> Something I have been thinking about.... >>>>>> >>>>>> you could replace GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single >>>>>> bloomfilter implementation. It means a 2 step process to find >> matches >>> but >>>>>> it might be fast enough and reduce the overhead significantly. >>>>>> >>>>>> I did an in-memory and a relational DB based version recently, but it >>> was >>>>>> just a quick POC. >>>>>> >>>>>> Claude >>>>>> >>>>> >>>>> So we're talking about in-memory, where the items are java classes. A >>>>> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers. >>> That's 48 >>>>> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers >>> or 32 >>>>> bit). >>>>> >>>>> For storage, the key test is "contains" to maintain the "set of" >>>>> semantics. Something to stop index traversal for each insert would be >>>>> great but it's still stored and 1 not up to 6 would be good. (Note >> that >>>>> most data is unique quads.) >>>>> >>>>> The import retrieval operation is find(G,S,P,O) where any of those can >>> be >>>>> a wildcard and return (ideally as a stream) all matching quads with a >>>>> prefix. The multiple indexes exist to find based on prefix. >>>>> >>>>> How would that work for, say find(G,S,*,*) with bloom filters and 1b >>>>> quads? How does the code go from returning G,S,P1,O1 to the next >>> G,S,P1,O2 >>>>> without trying every value for the O slot? >>>>> >>>>> For a hash map based hierarchical index G->S->P->O, it's O(1) to find >>> the >>>>> start of the scan then datastructure iteration. A hash-based is not >>>>> necessarily the best choice [*] but it's a baseline to discuss. >>>>> >>>>> And in memory, will a bloom filter-based system be faster? Because of >>>>> false-positives, isn't a definitively index still needed? If one is >>> kept, >>>>> not 6, there could be great space gains but every quad returned is a >>>>> top-to-bottom traversal of that index (which is now a not a range >>> index). >>>>> >>>>> The design should work for 1+ billion in-memory quads - that's the way >>> the >>>>> world is going. >>>>> >>>>> So each quad is reduced to a >>>>>> single bloom filter comprising 4 items (15-bytes). >>>>>> >>>>> >>>>> Andy >>>>> >>>>> [*] even in memory, it might be worth allocating internal ids and >>> working >>>>> in longs like a disk based system because it is more compact - naive >>>>> hashmaps take a lot of space to when storing small items like quads. >>>>> tradeoffs, tradeoffs, ... >>>>> >>>>> >>>>> >>>>> >>>>>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <[email protected]> >> wrote: >>>>>> >>>>>> Hey, folks-- >>>>>>> >>>>>>> There hasn't been too much feedback on my proposal for a journaling >>>>>>> DatasetGraph: >>>>>>> >>>>>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph >>>>>>> >>>>>>> which was and is to be a step towards JENA-624: Develop a new >>> in-memory >>>>>>> RDF Dataset implementation. So I'm moving on to look at the real >>> problem: >>>>>>> an in-memory DatasetGraph with high concurrency, for use with >> modern >>>>>>> hardware running many, many threads in large core memory. >>>>>>> >>>>>>> I'm beginning to sketch out rough code, and I'd like to run some >>> design >>>>>>> decisions past the list to get criticism/advice/horrified >>>>>>> warnings/whatever >>>>>>> needs to be said. >>>>>>> >>>>>>> 1) All-transactional action: i.e. no non-transactional operation. >>> This is >>>>>>> obviously a great thing for simplifying my work, but I hope it won't >>> be >>>>>>> out >>>>>>> of line with the expected uses for this stuff. >>>>>>> >>>>>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO, >>> OPSG. I >>>>>>> figure to play to the strength of in-core-memory operation: raw >> speed, >>>>>>> but >>>>>>> obviously this is going to cost space. >>>>>>> >>>>>>> 3) At least for now, all commits succeed. >>>>>>> >>>>>>> 4) The use of persistent datastructures to avoid complex and >>> error-prone >>>>>>> fine-grained locking regimes. I'm using http://pcollections.org/ >> for >>>>>>> now, >>>>>>> but I am in no way committed to it nor do I claim to have thoroughly >>>>>>> vetted >>>>>>> it. It's simple but enough to get started, and that's all I need to >>> bring >>>>>>> the real design questions into focus. >>>>>>> >>>>>>> 5) Snapshot isolation. Transactions do not see commits that occur >>> during >>>>>>> their lifetime. Each works entirely from the state of the >>> DatasetGraph at >>>>>>> the start of its life. >>>>>>> >>>>>>> 6) Only as many as one transaction per thread, for now. Transactions >>> are >>>>>>> not thread-safe. These are simplifying assumptions that could be >>> relaxed >>>>>>> later. >>>>>>> >>>>>>> My current design operates as follows: >>>>>>> >>>>>>> At the start of a transaction, a fresh in-transaction reference is >>> taken >>>>>>> atomically from the AtomicReference that points to the index block. >> As >>>>>>> operations are performed in the transaction, that in-transaction >>>>>>> reference >>>>>>> is progressed (in the sense in which any persistent datastructure is >>>>>>> progressed) while the operations are recorded. Upon an abort, the >>>>>>> in-transaction reference and the record are just thrown away. Upon a >>>>>>> commit, the in-transaction reference is thrown away and the >> operation >>>>>>> record is re-run against the main reference (the one that is copied >> at >>>>>>> the >>>>>>> beginning of a transaction). That rerun happens inside an atomic >>> update >>>>>>> (hence the use of AtomicReference). This all should avoid the need >> for >>>>>>> explicit locking in Jena and should confine any blocking against the >>>>>>> indexes to the actual duration of a commit. >>>>>>> >>>>>>> What do you guys think? >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> A. Soroka >>>>>>> The University of Virginia Library >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> I like: Like Like - The likeliest place on the web >>>> <http://like-like.xenei.com> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren >>> >>> >> >> >> -- >> I like: Like Like - The likeliest place on the web >> <http://like-like.xenei.com> >> LinkedIn: http://www.linkedin.com/in/claudewarren >> > > > > -- > Paul Houle > > *Applying Schemas for Natural Language Processing, Distributed Systems, > Classification and Text Mining and Data Lakes* > > (607) 539 6254 paul.houle on Skype [email protected] > > :BaseKB -- Query Freebase Data With SPARQL > http://basekb.com/gold/ > > Legal Entity Identifier Lookup > https://legalentityidentifier.info/lei/lookup/ > <http://legalentityidentifier.info/lei/lookup/> > > Join our Data Lakes group on LinkedIn > https://www.linkedin.com/grp/home?gid=8267275
