Re: A story of deleted blobs

Stephen Searles Mon, 07 May 2018 19:49:17 -0700

I have the GPG key, and I have *some* backup, but the original deletion
event was a while ago. I don't know much about the indexer, but I think
this means the backup is not old enough:


camlistore=# select * from rows where k LIKE
'%sha1-0009743ea7ead6126510ad334cfe4199c2c383f7%';
 k | v
---+---
(0 rows)

camlistore=# select * from rows where v LIKE
'%sha1-0009743ea7ead6126510ad334cfe4199c2c383f7%';
 k | v
---+---
(0 rows)


... but now you've got me thinking about it, and there might be an older
copy of my data on a hard drive that's just been collecting dust since I
moved in the fall. If it does have data, it may not have 100% of the
deleted blobs, but it might have even most of them. I don't think it'd be a
postgres index, but it might have an on-disk one. I'll try to dig that out
tomorrow.

That --keep-going flag sounds like it'd be very useful at this point. Let
me know if you're doing this, otherwise I'll try to tackle it this weekend.

Thanks for the help!

On Sun, May 6, 2018 at 10:28 PM Brad Fitzpatrick <[email protected]> wrote:

> Ouch! Sorry to hear. :-(
>
> We are aware that the re-indexing errors aren't useful enough on
> failure. I filed https://github.com/perkeep/perkeep/issues/1122
> recently. I plan to address that when I redo the reindexer to make it
> faster.
>
> But it's easy enough to for us to add some --keep-going sort of flag
> to perkeepd so you can at least get your server back up, even if the
> index & blobs are incomplete. At least searches would work, if you're
> trying to find your unique perkeep-only content.
>
> Really it should give you a path to each missing blob so you could
> have some context for what it might be.
>
> Also, do you have a backup of an old index anywhere? We might be able
> to recover a bunch of the blobs just from the index, as long as you
> still have your signing GPG keyring.
>
> And if a bunch of those 0.2% blobs are JPEG data, we could trace back
> from missing chunk to file to permanode to imported Google Photos
> item, and just re-download the file. Of course, it'd make sha224 blobs
> now, but from the original photo you could find the range of the file
> that's the sha1 blob you're looking for.
>
> Which step would be most helpful first? A flag to the reindexer to
> --keep-going?
>
>
> On Sun, May 6, 2018 at 11:50 AM, stephen.searles
> <[email protected]> wrote:
> > So, a couple months ago, I made a mistake and deleted some data... I'm
> going
> > to share the experience here, and outline some of the plans I've got to
> move
> > on. Help would be welcome, but mostly my aim here is to provide some
> insight
> > as to how things go when they go wrong, for considerations on
> improvements.
> >
> > I was working on testing out configuring Digital Ocean's new Spaces
> product
> > as various backing store to Perkeep (well, a version that was still
> > camlistore). At some point during the process, I must have deleted some
> > blobpacked files on my server. I know I was making odd rsync commands at
> one
> > point that day, but I don't know what did it. Whatever it was it deleted
> an
> > early segment of blobs starting with "sha1-000" up through about
> "sha1-004".
> > At the time, I didn't realize. Perkeep must have had blobs cached, and so
> > things were fine in the UI. I had eventually set aside work on that
> Spaces
> > implementation, because of (DO's) performance problems. Then, the other
> day,
> > I went to go update to a more recent version (to Perkeep!). After a
> little
> > config updating, I eventually got it running, but the UI wasn't showing
> my
> > content properly. It just looked like a huge list of folders, with no
> names
> > and no meaningful contents (just poking at a few). Searches I used to
> run in
> > the UI don't return any results. I ran reindexing and eventually did both
> > recovery modes. That's when I started seeing the errors about missing
> blobs,
> > all in that early range of sha1s.
> >
> >> May  6 10:51:22 new perkeepd[2237]: 2018/05/06 10:51:22 Error reindexing
> >> sha1-0009743ea7ead6126510ad334cfe4199c2c383f7: index: failed to fetch
> >> sha1-0009743ea7ead6126510ad334cfe4199c2c383f7 for reindexing: file does
> not
> >> exist
> >
> >
> > So my first observation here: problems on the backend storage can easily
> go
> > unseen. Earlier detection of the problem may have allowed me to recover
> from
> > cache. The second observation: recovery/reindex errors cause the
> instance to
> > fail to start with limited info for repair: (what's telling it to reindex
> > something that doesn't exist?).
> >
> > So the situation I find myself in now: I deleted about 0.2% of my data,
> but
> > the instance is more or less hosed. (And the data is just gone: this is
> how
> > I realized my backups aren't covering the block storage I moved my
> perkeep
> > data off to... oops). I have a few years of data in that instance. I
> have a
> > few importers: google photos and rss feeds. I have a few attempts at
> syncing
> > my music and whole home directories up to it. Then there's just some odds
> > and ends uploaded via the UI. Those are the important bits that would be
> > great to recover, mostly because I'm not sure what might be there, and
> the
> > rest is recoverable elsewhere.
> >
> > The crossroads is between: clean up the surviving data to repair the
> > instance or search for non-re-importable data and starting with a fresh
> new
> > instance. I'm not sure what the process for repairing the instance would
> > be... when the index complains about blobs being missing, what causes it
> to
> > expect those blobs? Is it possible to look up blobs which reference other
> > specific (but non-existent) blobs? I'm not sure what schema relations to
> > interrogate, so searching for non-re-importable data sounds easier. I can
> > start to build up a search query full of exclusions until I've gotten
> rid of
> > all the imported data. That's what I intend to try first.
> >
> > That said, my third observation: there doesn't seem to be a good way to
> > analyze a perkeep instance's data in aggregate without a lot of manual
> > labor. (Or I just haven't seen it yet?)
> >
> > I'm not sure what, if any, good improvements we could make to perkeep
> based
> > on this information, but I'm happy to keep discussing or share more as my
> > discovery/recovery process continues.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Perkeep" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: A story of deleted blobs

Reply via email to