[DiSCUSS] - highly vs rarely used data

Tommaso Teofili Fri, 23 Jun 2017 01:23:17 -0700

Hi all,

recently I've been at a conference [1] where I attended an interesting
keynote about data management [2] (I think it refers to this 2016 paper
[3]).


Apart from the approaches proposed to solve the data management problem
(e.g. get rid of DBMSs!) I got interested in the discussion about how we
deal with the increasing amount of data that we have to manage (also
because of some issues we have [4]).
In many systems only a very small subset of the data is used because the
amount of information users really need refers only to most recently
ingested data (e.g. social networks); while that doesn't always apply for
content repositories in general (e.g. if you build a CMS on top of it) I
think it's interesting to think about whether we can optimize our
persistence layer to work better with highly used data (e.g. more recent)
and use less space/cpu for data that is used more rarely.

For example, putting this together with the incremental indexing section of
the paper [3] I was thinking (but that's already a solution rather than
"just" a discussion) perhaps we could simply avoid indexing *some* content
until it's needed (e.g. the first time you get traversal, then index so
that next query over same data will be faster) but that's just an example.

What do others think ?
Regards,
Tommaso

[1] : http://www.iccs-meeting.org/iccs2017/
[2] : http://www.iccs-meeting.org/iccs2017/keynote-lectures/#Ailamaki
[3] : https://infoscience.epfl.ch/record/219993/files/p12-pavlovic.pdf
[4] : https://issues.apache.org/jira/browse/OAK-5192

[DiSCUSS] - highly vs rarely used data

Reply via email to