Probably this should be allowed to do using public API, actually this is same as manual rebalancing.
пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < alexey.scherbak...@gmail.com>: > The poor man's solution for the problem would be stopping fragmented node > and removing partition data, then starting it again allowing full state > transfer already without deletes. > Rinse and repeat for all owners. > > Anton Vinogradov, would this work for you as workaround ? > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <a...@apache.org>: > >> Alexey, >> >> Let's combine your and Ivan's proposals. >> >> >> vacuum command, which acquires exclusive table lock, so no concurrent >> activities on the table are possible. >> and >> >> Could the problem be solved by stopping a node which needs to be >> defragmented, clearing persistence files and restarting the node? >> >> After rebalancing the node will receive all data back without >> fragmentation. >> >> How about to have special partition state SHRINKING? >> This state should mean that partition unavailable for reads and updates >> but >> should keep it's update-counters and should not be marked as lost, renting >> or evicted. >> At this state we able to iterate over the partition and apply it's entries >> to another file in a compact way. >> Indices should be updated during the copy-on-shrink procedure or at the >> shrink completion. >> Once shrank file is ready we should replace the original partition file >> with it and mark it as MOVING which will start the historical rebalance. >> Shrinking should be performed during the low activity periods, but even in >> case we found that activity was high and historical rebalance is not >> suitable we may just remove the file and use regular rebalance to restore >> the partition (this will also lead to shrink). >> >> BTW, seems, we able to implement partition shrink in a cheap way. >> We may just use rebalancing code to apply fat partition's entries to the >> new file. >> So, 3 stages here: local rebalance, indices update and global historical >> rebalance. >> >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk < >> alexey.goncha...@gmail.com> wrote: >> >> > Anton, >> > >> > >> > > >> The solution which Anton suggested does not look easy because it >> will >> > > most likely significantly hurt performance >> > > Mostly agree here, but what drop do we expect? What price do we ready >> to >> > > pay? >> > > Not sure, but seems some vendors ready to pay, for example, 5% drop >> for >> > > this. >> > >> > 5% may be a big drop for some use-cases, so I think we should look at >> how >> > to improve performance, not how to make it worse. >> > >> > >> > > >> > > >> it is hard to maintain a data structure to choose "page from >> free-list >> > > with enough space closest to the beginning of the file". >> > > We can just split each free-list bucket to the couple and use first >> for >> > > pages in the first half of the file and the second for the last. >> > > Only two buckets required here since, during the file shrink, first >> > > bucket's window will be shrank too. >> > > Seems, this give us the same price on put, just use the first bucket >> in >> > > case it's not empty. >> > > Remove price (with merge) will be increased, of course. >> > > >> > > The compromise solution is to have priority put (to the first path of >> the >> > > file), with keeping removal as is, and schedulable per-page migration >> for >> > > the rest of the data during the low activity period. >> > > >> > Free lists are large and slow by themselves, it is expensive to >> checkpoint >> > and read them on start, so as a long-term solution I would look into >> > removing them. Moreover, not sure if adding yet another background >> process >> > will improve the codebase reliability and simplicity. >> > >> > If we want to go the hard path, I would look at free page tracking >> bitmap - >> > a special bitmask page, where each page in an adjacent block is marked >> as 0 >> > if it has free space more than a certain configurable threshold (say, >> 80%) >> > - free, and 1 if less (full). Some vendors have successfully implemented >> > this approach, which looks much more promising, but harder to implement. >> > >> > --AG >> > >> > > > -- > > Best regards, > Alexei Scherbakov > -- Best regards, Alexei Scherbakov