Hi, I guess you talk about Amazon Glacier. Did you know about "Expedited retrievals" by the way? https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/ - it looks like it's more than just "slow" + "fast".
About deciding which binaries to move to the slow storage: It would be good if that's automatic. Couldn't that be based on access frequency + recency? If a binary is not accessed for some time, it is moved to slow storage. I would add: if it was not accessed for some time, _plus_ it was rarely accessed before. Reason: for caching, it is well known that not only the recency, but also frequency, are important to predict if an entry will be needed in the near future. To do that, we could maintain a log that tells you when, and how many times, a binary was read. Maybe Amazon / Azure keep some info about that, but let's assume not (or not in such a way we want or can use). For example, each client appends the blob ids that it reads to a file. Multiple such files could be merged. To save space for such files (probably not needed, but who knows): * Use a cache to avoid repeatedly writing the same id, in case it's accessed multiple times. * Maybe you don't care about smallish binaries (smaller than 1 MB for example), or care less about them. So, for example only move files larger than 1 MB. That means no need to add an entry. * A bloom filter or similar could be used (so you would retain x% too many entries). Or even simpler: only write the first x characters of the binary id. That way, we retain x% too much in fast storage, but save time, space, and memory for maintenance. Regards, Thomas On 26.06.17, 18:10, "Matt Ryan" <[email protected]> wrote: Hi, With respect to Oak data stores, this is something I am hoping to support later this year after the implementation of the CompositeDataStore (which I'm still working on). First, the assumption is that there would be a working CompositeDataStore that can manage multiple data stores, and can select a data store for a blob based on something like a JCR property (I'm still figuring this part out). In such a case, it would be possible to add a property to blobs that can be archived, and then the CompositeDataStore could store them in a different location - think AWS Glacier if there were a Glacier-compatible data store. Of course this would require that we also support an access pattern in Oak where Oak knows that a blob can be retrieved but cannot reply to a request with the requested blob immediately. Instead Oak would have to give a response indicating "I can get it, but it will take a while" and suggest when it might be available. That's just one example. I believe once I figure out the CompositeDataStore it will be able to support a lot of neat scenarios from on the blob store side of things anyway. -MR On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella <[email protected]> wrote: > On 26/06/2017 09:00, Michael Dürig wrote: > > > > I agree we should have a better look at access patterns, not only for > > indexing. I recently came across a repository with about 65% of its > > content in the version store. That content is pretty much archived and > > never accessed. Yet it fragments the index and thus impacts general > > access times. > > I may say something stupid as usual, but here I can see for example that > such content could be "moved to a slower repository". So for example > speaking of segment, it could be stored in a compressed segment (rather > than plain tar) and the repository could either automatically configure > the indexes to skip such part or/and additionally create an ad-hoc index > which could async by definition every, let's say, 10 seconds. > > We would gain on the repository size and indexing speed. > > Just a couple of ideas off the top of my head. > > Davide > > >
