[okular] [Bug 436738] docdata duplicated each time pdf is edited

2023-02-14 Thread yan12125
https://bugs.kde.org/show_bug.cgi?id=436738

yan12125  changed:

   What|Removed |Added

 CC|yu3actxt2tttf...@chyen.cc   |

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-09-23 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

yan12...@gmail.com changed:

   What|Removed |Added

 CC||yan12...@gmail.com

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-19 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #18 from Albert Astals Cid  ---
(In reply to pbs3141 from comment #17)
> > So I open a file, add an annotation, save it as filea.pdf then move the
> > annotation around and save it as fileb.pdf. Your algorithm would think it's
> > the same file.
> 
> Yes, except in the rare case that the annotation coordinates lie in a 4kB
> chunk. This indeterminacy alone is enough to make me not like my partial
> hashing scheme, unless it can be guaranteed that the annotation data will
> always be included.
> 
> Maybe the hash of the whole file is the way to go. Possibly falling back to
> partial hashing only for huge files.
> 
> By the way, if I add and save an annotation on a huge PDF file (> 1GB), does
> the whole PDF get rewritten out, or just the annotation?

Normal behaviour is append at the end.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-18 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #17 from pbs3...@googlemail.com ---
> So I open a file, add an annotation, save it as filea.pdf then move the
> annotation around and save it as fileb.pdf. Your algorithm would think it's
> the same file.

Yes, except in the rare case that the annotation coordinates lie in a 4kB
chunk. This indeterminacy alone is enough to make me not like my partial
hashing scheme, unless it can be guaranteed that the annotation data will
always be included.

Maybe the hash of the whole file is the way to go. Possibly falling back to
partial hashing only for huge files.

By the way, if I add and save an annotation on a huge PDF file (> 1GB), does
the whole PDF get rewritten out, or just the annotation?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-18 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #16 from Albert Astals Cid  ---
> The SHA algorithms don’t make collisions that we should care about. :)

I don't understand what you mean with this?

> For PDF, the hash of filesize + a couple of 4kB chunks throughout the file 
> would surely be good enough. For some formats I can imagine users might want 
> to change small bits of the file in a way this can't detect, but PDF isn't 
> one of them.

So I open a file, add an annotation, save it as filea.pdf then move the
annotation around and save it as fileb.pdf. Your algorithm would think it's the
same file

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-17 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #15 from pbs3...@googlemail.com ---
> My idea would be to store docdata (maybe including thumbnails) hashed by the 
> file name/path/content, and encrypted with a hash of the file content, so 
> they can only be read with read access to the document file (or a copy of it).

So, you're suggesting the encryption to address the privacy issue mentioned in
that thread. Would it not be simpler to make docdata only user-readable?

> I fear that what you're suggesting would create too much I/O. Each time i 
> open a PDF i have never opened before i would have to read all the filenames 
> in the docdata folder in case some of them has a matching sha.
>
> Doesn't sound like it would work fine at scale.

No, I only suggested testing the existence of docdata/$HASH and
docdata/$FULLPATH, which takes constant io. (I think David already said the
same thing in the next comment.)

The only potential source of large io in my suggestion was the amortised
deletion of stale files. The difficulty is in randomly selecting k files from a
directory containing N files, where say k ~ 5 and N ~ 5000. I think this can
still be done quickly, in O(k) io not O(N), by walking the linked list returned
by opendir, but only reading a random selection of k files. But I'll need to
benchmark / read up on disk formats to be sure.

> I don't like the idea of identifying the docdata exclusively by hash.

It's good enough for git! Surely it should be good enough here?

> Hashes by definition will have collisions, so will have filenames+filesize, 
> but it's much easier to explain that two documents "share" their docdata 
> because of that (and if the user actually has two files with the same 
> filename and size and are not the same, she can rename one of the files) than 
> the fact that if they share the hash of the first N bytes, which is something 
> that no one "normal" can really understand and if even they understand they 
> can't fix it.

I don't like hashing the whole file. Users may want to open some pretty large
PDFs. I've personally needed to view a PDF of a long slideshow with many large
pictures that was over 1GB. I shouldn't have to hash the whole lot just to view
a small part of it.

For PDF, the hash of filesize + a couple of 4kB chunks throughout the file
would surely be good enough. For some formats I can imagine users might want to
change small bits of the file in a way this can't detect, but PDF isn't one of
them.

> Ok, now I understand. pbs3141 and me suggested mostly the same, just that my 
> suggestion does not use any filepaths, and so does not need to process 
> docdata files to decide whether to delete them.

The thing that is lacking with an implementation that doesn't use filepaths is
that if you overwrite a PDF in-place, then you will lose the viewing data if it
is not currently open in Okular. (I encounter this problem frequently when
using LyX.)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-15 Thread David Hurka
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #14 from David Hurka  ---
> My answer was to pbs3141 that was suggesting something 
> different as far as I understand.
Ok, now I understand. pbs3141 and me suggested mostly the same, just that my
suggestion does not use any filepaths, and so does not need to process docdata
files to decide whether to delete them.

The fear for collisions is probably real for .txt files and similar, where two
different documents can easily have the same first 4kB. (The text on the first
two pages are the same.) The SHA algorithms don’t make collisions that we
should care about. :)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-15 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #13 from Albert Astals Cid  ---
(In reply to David Hurka from comment #12)
> Every time I open a document, Okular checks whether the file
> “,.xml” exists. Is that different to checking whether the
> file “.xml” exists?
> 
> Or do I understand something wrong?

My answer was to pbs3141 that was suggesting something different as far as I
understand.

I don't like the idea of identifying the docdata exclusively by hash.

Hashes by definition will have collisions, so will have filenames+filesize, but
it's much easier to explain that two documents "share" their docdata because of
that (and if the user actually has two files with the same filename and size
and are not the same, she can rename one of the files) than the fact that if
they share the hash of the first N bytes, which is something that no one
"normal" can really understand and if even they understand they can't fix it.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-15 Thread David Hurka
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #12 from David Hurka  ---
Every time I open a document, Okular checks whether the file
“,.xml” exists. Is that different to checking whether the file
“.xml” exists?

Or do I understand something wrong?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-13 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #11 from Albert Astals Cid  ---
I fear that what you're suggesting would create too much I/O. Each time i open
a PDF i have never opened before i would have to read all the filenames in the
docdata folder in case some of them has a matching sha.

Doesn't sound like it would work fine at scale.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-13 Thread David Hurka
https://bugs.kde.org/show_bug.cgi?id=436738

David Hurka  changed:

   What|Removed |Added

   See Also||https://bugs.kde.org/show_b
   ||ug.cgi?id=317856

--- Comment #10 from David Hurka  ---
In Bug 317856, it was requested to store the file name only as a hash.

Related, in the merge request
https://invent.kde.org/graphics/okular/-/merge_requests/422#note_238154 (Create
“Welcome screen” that replaces window where nearly all widgets are in disabled
state) Jiří Wolker writes:

> And there is also privacy risk – users sometimes do not want to store
> thumbnails of their documents. (Example: Home directory incl. config
> files is not encrypted, Documents is. When user opens file from Documents,
> the thumbnail gets stored in the home directory. This makes unencrypted
> image of part of the file.)

My idea would be to store docdata (maybe including thumbnails) hashed by the
file name/path/content, and encrypted with a hash of the file content, so they
can only be read with read access to the document file (or a copy of it).

To delete old docdata files, there could be a list of the last 5000 used
docdata files. This list is updated every time a file is opened/closed, and
those docdata not anymore in the list are deleted. Those docdata which are so
old that they leave the “Open Recent” list, are stripped from their thumbnail.

But this way, there wouldn’t be a reliable way to detect duplicate docdata
files.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-12 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #9 from pbs3...@googlemail.com ---
> Now we are talking a bit more general. Should we split this into two bug 
> reports?

If you're considering enlarging the scope of the bug report, then this may
provide an opportunity to fix both this issue and the file-rename issue all at
once.

How about storing a docdata file whose name is the hash of the file size and
the first 4kB (or middle 4kB, or whatever). Assume this is as good as a hash of
the whole file, though obviously less expensive to compute.

Store in the docdata file all full filepaths where this document has been
opened from. (This is already done, according to Comment 7.)

Purge from each docdata file any filepaths that have been deleted, and purge
any docdata files that have had no filepaths for 6 months (or some configurable
expiration period). Do this in an amortised / randomised fashion, only checking
a few files on each startup, to keep the io negligible.

That fixes file rename. To deal with modifications, create soft links "full
path" -> "docdata file" in docdata directory. If a file is opened with no
matching docdata file for its hash, search instead by filename, and if one is
found, use that. (And write out a new docdata file named by the hash.) Purge
old links where the path no longer exists in an amortised manner similar to
before.

Has something like this been previously considered and ruled out?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-09 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #8 from Albert Astals Cid  ---
> So it is not just (filesize,filename), but actually stores the full file path.

Is that full file path used anywhere for anything? (yes, yes, i know why are we
storing the url if we don't use it, good question)

> The workflow originally addressed by this bug report was that Okular 
> automatically reloads the document, and so actively creates these duplicates. 
> Now we are talking a bit more general. Should we split this into two bug 
> reports?

My question is how can Okular know that the user wants to remove the file, for
humans that are inteligent beings it's quite easy to see but for an app, I
don't see how it can figure out that the scenario is one in such the data
doesn't have to be stored.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-08 Thread David Hurka
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #7 from David Hurka  ---
This is the content of my 13572671.pgfplots.pdf.xml : (If formatting doesn’t
break here)




 
  
   
   [...]
   
  
  
   




   
  
 


So it is not just (filesize,filename), but actually stores the full file path.

When I overwrite one instance of (filesize,path), that means that there can’t
be another instance of (filesize,path) somewhere in the system. Except when the
user actively restores an old version of the file.

Is the url attribute ignored when docdata files for an opened document are
searched? In that case it is true that there may be another file which fits
this docdata file.

The workflow originally addressed by this bug report was that Okular
automatically reloads the document, and so actively creates these duplicates.
Now we are talking a bit more general. Should we split this into two bug
reports?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-08 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #6 from pbs3...@googlemail.com ---
Albert's point is that I could make many copies of my pdf file, then overwrite
it. Under my proposal, all the copies would have their viewing positions reset.

(That is, assuming all the copies have been made in different directories
sharing the same filename. And also assuming that the new version is viewed
immediately after updating it.)

Given that Okular already resets the viewing position in far more everyday
situations like file rename, I don't see how resetting the viewing position in
this exotic situation is such a big deal.

Still, my proposal would probably annoy that one user who takes regular
snapshots of their system, and regularly looks back at old versions of their
pdf documents in old snapshots, who would now find their viewing position keep
resetting.

It's a question of balancing the effect of breaking one person's workflow by
changing something they shouldn't be relying on anyway (https://xkcd.com/1172),
given that Okular doesn't preserve viewing position that well in general, vs
littering everyone else's systems with thousands of tiny harmless files which,
while not taking up very much space, is certainly far from optimal.

I leave it up to you!

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-08 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

Albert Astals Cid  changed:

   What|Removed |Added

 Status|CONFIRMED   |REPORTED
 Ever confirmed|1   |0

--- Comment #5 from Albert Astals Cid  ---
We only store filesize and filename so you can move files around and your
settings are kept.

The fact that you overwrote this instance of (filesize,filename) doesn't mean
you don't have other copies of (filesize,filename) in your filesystem, so no, i
don't see why we should assume that (filesize,filename) is not useful anymore.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-07 Thread David Hurka
https://bugs.kde.org/show_bug.cgi?id=436738

David Hurka  changed:

   What|Removed |Added

 Resolution|WAITINGFORINFO  |---
 Ever confirmed|0   |1
 Status|NEEDSINFO   |CONFIRMED

--- Comment #4 from David Hurka  ---
To me this appears clear. There is no point in storing old versions of these
files.

My suggestion is to delete old files that describe the same document, and also
delete the old file when the document is reloaded automatically.

In case we don’t want to save file names, but hashes of the file content, only
my second suggestion would apply. Delete/migrate old descriptions when the
document is reloaded.

Question: What do these numbers mean? For example, I have these two files:

13107254.pgfplots.pdf.xml
13572671.pgfplots.pdf.xml

They point to the same file under same paths, but do not store any timestamp.
If I remove all but one of them, nothing at all will happen, right?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-07 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #3 from pbs3...@googlemail.com ---
Here is another way to put it. The only way the old viewing parameters could
ever come in useful again is if I secretly kept a copy of the old version, and
overwrote the new version with it at some later date. Then Okular could say
"aha, let me take you back to where you were". It would then reset the viewing
position to how it was when I was last looking at it, which might be different
to the current viewing position. But what does the user want in this case?
They'd rather stay where they are. So the saved information proves to be of no
use.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-07 Thread bugzilla_noreply
https://bugs.kde.org/show_bug.cgi?id=436738

--- Comment #2 from pbs3...@googlemail.com ---
I would expect the old xml file to be overwritten with the new one. After all,
Okular clearly already knows that the new pdf file has overwritten the old one;
that's why it reloads the pdf and keeps the same viewing position in the way it
does. Given this, it doesn't make sense to keep the old xml file around. It's
like Okular's saying "I know you just overwrote your file, but I'm going to
keep the viewing parameters around for the old version just in case you want to
have another look at it."

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-07 Thread Nate Graham
https://bugs.kde.org/show_bug.cgi?id=436738

Nate Graham  changed:

   What|Removed |Added

 CC||n...@kde.org

-- 
You are receiving this mail because:
You are the assignee for the bug.

[okular] [Bug 436738] docdata duplicated each time pdf is edited

2021-05-07 Thread Albert Astals Cid
https://bugs.kde.org/show_bug.cgi?id=436738

Albert Astals Cid  changed:

   What|Removed |Added

 Resolution|--- |WAITINGFORINFO
 CC||aa...@kde.org
 Status|REPORTED|NEEDSINFO

--- Comment #1 from Albert Astals Cid  ---
I don't see how this is a bug, we need to save the settings, what would you
expect us to do?

-- 
You are receiving this mail because:
You are the assignee for the bug.