Re: [DiSCUSS] - highly vs rarely used data

Thomas Mueller Fri, 30 Jun 2017 01:44:34 -0700

Hi,

I guess you talk about Amazon Glacier. Did you know about "Expedited 
retrievals" by the way? 
https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/
 - it looks like it's more than just "slow" + "fast".

About deciding which binaries to move to the slow storage: It would be good if 
that's automatic. Couldn't that be based on access frequency + recency? If a 
binary is not accessed for some time, it is moved to slow storage. I would add: 
if it was not accessed for some time, _plus_ it was rarely accessed before. 
Reason: for caching, it is well known that not only the recency, but also 
frequency, are important to predict if an entry will be needed in the near 
future. To do that, we could maintain a log that tells you when, and how many 
times, a binary was read. Maybe Amazon / Azure keep some info about that, but 
let's assume not (or not in such a way we want or can use). 

For example, each client appends the blob ids that it reads to a file. Multiple 
such files could be merged. To save space for such files (probably not needed, 
but who knows):

* Use a cache to avoid repeatedly writing the same id, in case it's accessed 
multiple times.
* Maybe you don't care about smallish binaries (smaller than 1 MB for example), 
or care less about them. So, for example only move files larger than 1 MB. That 
means no need to add an entry.
* A bloom filter or similar could be used (so you would retain x% too many 
entries). Or even simpler: only write the first x characters of the binary id. 
That way, we retain x% too much in fast storage, but save time, space, and 
memory for maintenance.

Regards,
Thomas

On 26.06.17, 18:10, "Matt Ryan" <[email protected]> wrote:

    Hi,

    With respect to Oak data stores, this is something I am hoping to support
    later this year after the implementation of the CompositeDataStore (which
    I'm still working on).

    First, the assumption is that there would be a working CompositeDataStore
    that can manage multiple data stores, and can select a data store for a
    blob based on something like a JCR property (I'm still figuring this part
    out).  In such a case, it would be possible to add a property to blobs that
    can be archived, and then the CompositeDataStore could store them in a
    different location - think AWS Glacier if there were a Glacier-compatible
    data store.  Of course this would require that we also support an access
    pattern in Oak where Oak knows that a blob can be retrieved but cannot
    reply to a request with the requested blob immediately.  Instead Oak would
    have to give a response indicating "I can get it, but it will take a while"
    and suggest when it might be available.

    That's just one example.  I believe once I figure out the
    CompositeDataStore it will be able to support a lot of neat scenarios from
    on the blob store side of things anyway.

    -MR

    On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella <[email protected]> wrote:

    > On 26/06/2017 09:00, Michael Dürig wrote:
    > >
    > > I agree we should have a better look at access patterns, not only for
    > > indexing. I recently came across a repository with about 65% of its
    > > content in the version store. That content is pretty much archived and
    > > never accessed. Yet it fragments the index and thus impacts general
    > > access times.
    >
    > I may say something stupid as usual, but here I can see for example that
    > such content could be "moved to a slower repository". So for example
    > speaking of segment, it could be stored in a compressed segment (rather
    > than plain tar) and the repository could either automatically configure
    > the indexes to skip such part or/and additionally create an ad-hoc index
    > which could async by definition every, let's say, 10 seconds.
    >
    > We would gain on the repository size and indexing speed.
    >
    > Just a couple of ideas off the top of my head.
    >
    > Davide
    >
    >
    >

Re: [DiSCUSS] - highly vs rarely used data

Reply via email to