https://bugs.kde.org/show_bug.cgi?id=444520

--- Comment #7 from Adam Fontenot <adam.m.fontenot+...@gmail.com> ---
(In reply to tagwerk19 from comment #6)
> I'm happy to check behaviour if you can generate a test PDF/SVG and
> upload/attach it
Here's the original file that caused the problem:
https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

Note that it's likely to cause problems if you download it on a system with
Baloo's content indexing enabled. So be cautious. And to be clear, I don't have
any direct evidence to suggest that this file was responsible for Baloo's
balooning index.

> > If the index for a file grows to be larger than the original file, kill the
> > extraction process, add the file to a list of failed files, delete the index
> > for it, and don't try indexing the content of the file again. 
> I don't think there's an easy relation between the size of the source and
> the size of the index. The index contains "lookups", you type a search term
> and a list of hits gets pulled off disc. The design decision was for speed;
> you get a refined list of hits in Dolphin as you type more letters into the
> search box or view your files in folders based on the tags you've given them.
That's a fair point. Let me put it a different way. 

The laptop in question has a 128 GB SSD. That's not an uncommon size for
inexpensive laptops that come with an SSD. A user might reserve 50 GB or so for
their home partition, and have let's say 35 GB of files on it. My point is just
that it's *understandable* for such a user to be upset about an automatically
enabled system component randomly deciding to use 5+ GB of the remaining free
space. SSDs mean that storage is now often at a premium again, and many users
will not be willing to trade a large percent of free space for slightly faster
/ better file searches.

So while I can't speak to the internal architecture or tradeoffs of Baloo, I
can say from a user perspective that an index of files using more than 10% of
the total size of those files feels really bad. If there's a good reason that
you can't guarantee that the index for a file isn't larger than the file, let
me suggest an alternative. Is the algorithm Baloo uses to decide whether to
create a content index for a file tunable at all? Perhaps an option to limit
the size of the Baloo cache could be provided: either X GB or X% of free space.
Given the available space, Baloo could manage its storage to not index files
that are less usefully indexed. E.g. if there's one file that is 20 MB but
using 2 GB of index space, it's going to be the first to go.

At any rate, my reasoning for limiting the size of the index on original files
was that it's a pretty good heuristic for filtering out files that don't
contain indexable content. For example, biologists frequently use plain text
"SAM" files, which contain long strings of meaningful but not indexable text,
representing bits of DNA and metadata. E.g.
"ATAGCACTCAAGCAATCAAATCAAATAGCCAACTCCTTATCTCAACTCTCC". These files might be
under 10 MB, and they might have a .sam, .txt, or no extension at all.
Obviously such files should not be indexed, but it's difficult for a user to
ensure they're eliminating them all. This goes back to the "just works"
principle: insofar as possible, content indexing should quietly make searches
better without ever significantly impacting system resources. This implies the
need for heuristics to prevent indexing files like this.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to