[gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread Alan McKinnon
Hi all,

Haven't had much luck finding this info:

If I reboot this machine and start KDE, Nepomuk starts a rather long-lived 
index of my home directory. It takes up about 30-40% cpu and lasts as much as 
15 minutes sometimes. This is annoying because after a reboot I usually want 
to catch up on mail, rss feeds and fire up VirtualBox. So nepomuk is just 
wasting my time at this point.

How does nepomuk know when to do it's thing, how can I tweak what it does and 
how can I discover why it feels it necessary to reindex my entire maildir when 
surely it has a perfectly valid index already from just before I shut down?

Strigi is also enabled if that's relevant to the question.


-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread Paul Hartman
On Fri, Nov 19, 2010 at 9:17 AM, Alan McKinnon alan.mckin...@gmail.com wrote:
 Hi all,

 Haven't had much luck finding this info:

 If I reboot this machine and start KDE, Nepomuk starts a rather long-lived
 index of my home directory. It takes up about 30-40% cpu and lasts as much as
 15 minutes sometimes. This is annoying because after a reboot I usually want
 to catch up on mail, rss feeds and fire up VirtualBox. So nepomuk is just
 wasting my time at this point.

My /guess/ is that it scans every time you restart to be sure nothing
changed while it was shutdown. It doesn't know if you've dual-booted,
logged into xfce, mounted the disk in another machine, had fsck remove
files, etc.

I think Tracker behaves the same way in gnome-land.

 How does nepomuk know when to do it's thing, how can I tweak what it does and
 how can I discover why it feels it necessary to reindex my entire maildir when
 surely it has a perfectly valid index already from just before I shut down?

I am pretty sure it is tied to your KDE user session, and not running
as a system daemon in the background. Perhaps you can suspend it via
some autostarting script, and then resume it after whatever amount of
time you're comfortable with.

Looking in here:
http://api.kde.org/4.5-api/kdebase-runtime-apidocs/nepomuk/html/classNepomuk_1_1IndexScheduler.html

In the indexing speed settings, it says:

enum Nepomuk::IndexScheduler::IndexingSpeed

Enumerator:
FullSpeed   Index at full speed, i.e. do not use any artificial delays.
This is the mode used if the user is away.

ReducedSpeedReduce the indexing speed mildly.
This is the normal mode used while the user works. The indexer
uses small delay between indexing two files in order to keep the load
on CPU and IO down.

SnailPace   Like ReducedSpeed delays are used but they are much
longer to get even less CPU and IO load.
This mode is used for the first 2 minutes after startup to give
the KDE session manager time to start up the KDE session rapidly.


So based on that, for the first 2 minutes after KDE starts it should
be using the least aggressive indexing speed (but indexing
nevertheless).

(Personally I've always had all that indexing/social-semantic-desktop
stuff disabled completely.)



Re: [gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread BRM
- Original Message 

 From: Paul Hartman paul.hartman+gen...@gmail.com
 To: gentoo-user@lists.gentoo.org
 Sent: Fri, November 19, 2010 11:31:39 AM
 Subject: Re: [gentoo-user] Nepomuk indexing, what triggers it?
 
 On Fri, Nov 19, 2010 at 9:17 AM, Alan McKinnon alan.mckin...@gmail.com  
wrote:
  Hi all,
 
  Haven't had much luck finding this  info:
 
  If I reboot this machine and start KDE, Nepomuk starts a  rather long-lived
  index of my home directory. It takes up about 30-40%  cpu and lasts as much 
as
  15 minutes sometimes. This is annoying because  after a reboot I usually 
want
  to catch up on mail, rss feeds and fire up  VirtualBox. So nepomuk is just
  wasting my time at this point.
 
 My  /guess/ is that it scans every time you restart to be sure nothing
 changed  while it was shutdown. It doesn't know if you've dual-booted,
 logged into  xfce, mounted the disk in another machine, had fsck remove
 files,  etc.
 
 I think Tracker behaves the same way in gnome-land.

To add to it - Nepomuk has two parts (according to 
http://nepomuk.kde.org/node/2) that seem to be active in here:
1. Strigi - 
http://techbase.kde.org/Development/Tutorials/Metadata/Nepomuk/StrigiService
2. FileWatchService - 
http://techbase.kde.org/Development/Tutorials/Metadata/Nepomuk/FileWatchService

From the FileWatchService info:

However: due to the restrictions of all file watching systems  available 
(systems such as inotify are restricted to 8000 something  watches, fam does 
not 
support file moving monitoring, etc.) the service  mostly relies on KDirNotify. 
Thus, all operations performed by KDE  applications through KIO are monitored 
while all other operations (such  as console commands) are missed.

So it really does need to check up on things during restart to get back in 
sync, 
but also to find what it didn't know about from info not going through an 
interface it is aware of.

Ben




Re: [gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread Alan McKinnon
Apparently, though unproven, at 18:31 on Friday 19 November 2010, Paul Hartman 
did opine thusly:

 On Fri, Nov 19, 2010 at 9:17 AM, Alan McKinnon alan.mckin...@gmail.com 
wrote:
  Hi all,
  
  Haven't had much luck finding this info:
  
  If I reboot this machine and start KDE, Nepomuk starts a rather
  long-lived index of my home directory. It takes up about 30-40% cpu and
  lasts as much as 15 minutes sometimes. This is annoying because after a
  reboot I usually want to catch up on mail, rss feeds and fire up
  VirtualBox. So nepomuk is just wasting my time at this point.
 
 My /guess/ is that it scans every time you restart to be sure nothing
 changed while it was shutdown. It doesn't know if you've dual-booted,
 logged into xfce, mounted the disk in another machine, had fsck remove
 files, etc.
 
 I think Tracker behaves the same way in gnome-land.

I think that's a bit silly, so do a full scan just in case stuff changed.

If so, a very simple optimization would be to calculate a hash of some aspect 
of a directory, store the hash persistently, and only do a full scan if the 
hash is different.

I haven't read the code, so I'm in no real position to know how it's done or 
how to optimize it.

  How does nepomuk know when to do it's thing, how can I tweak what it does
  and how can I discover why it feels it necessary to reindex my entire
  maildir when surely it has a perfectly valid index already from just
  before I shut down?
 
 I am pretty sure it is tied to your KDE user session, and not running
 as a system daemon in the background. Perhaps you can suspend it via
 some autostarting script, and then resume it after whatever amount of
 time you're comfortable with.
 
 Looking in here:
 http://api.kde.org/4.5-api/kdebase-runtime-apidocs/nepomuk/html/classNepomu
 k_1_1IndexScheduler.html
 
 In the indexing speed settings, it says:
 
 enum Nepomuk::IndexScheduler::IndexingSpeed
 
 Enumerator:
 FullSpeed Index at full speed, i.e. do not use any artificial 
delays.
 This is the mode used if the user is away.
 
 ReducedSpeed  Reduce the indexing speed mildly.
 This is the normal mode used while the user works. The indexer
 uses small delay between indexing two files in order to keep the load
 on CPU and IO down.
 
 SnailPace Like ReducedSpeed delays are used but they are much
 longer to get even less CPU and IO load.
 This mode is used for the first 2 minutes after startup to give
 the KDE session manager time to start up the KDE session rapidly.
 
 
 So based on that, for the first 2 minutes after KDE starts it should
 be using the least aggressive indexing speed (but indexing
 nevertheless).

Good find. Personally, I'd like it to wait for 10-20 minutes after session 
start, then just run at SnailPace period. This machine is seldom booted or 
even logged out of KDE (I suspend) so I can tolerate the wait as it's rare

 (Personally I've always had all that indexing/social-semantic-desktop
 stuff disabled completely.)

Maybe I should too. But I *did* want to use this nepomuk thing myself for a 
while and see what the semantic-desktop can do for myself. It looks like it 
could be awesomely useful (like Google turned out to be awesomely useful) but 
it takes usage for real to know




-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread Alan McKinnon
Apparently, though unproven, at 20:13 on Friday 19 November 2010, BRM did 
opine thusly:

  My  guess is that it scans every time you restart to be sure nothing
  changed  while it was shutdown. It doesn't know if you've dual-booted,
  logged into  xfce, mounted the disk in another machine, had fsck remove
  files,  etc.
 
  
 
  I think Tracker behaves the same way in gnome-land.
 
 To add to it - Nepomuk has two parts (according to 
 http://nepomuk.kde.org/node/2) that seem to be active in here:
 1. Strigi - 
 http://techbase.kde.org/Development/Tutorials/Metadata/Nepomuk/StrigiServic
 e 2. FileWatchService -
 http://techbase.kde.org/Development/Tutorials/Metadata/Nepomuk/FileWatchSer
 vice
 
 From the FileWatchService info:
 
 However: due to the restrictions of all file watching systems  available 
 (systems such as inotify are restricted to 8000 something  watches, fam
 does not  support file moving monitoring, etc.) the service  mostly relies
 on KDirNotify. Thus, all operations performed by KDE  applications through
 KIO are monitored while all other operations (such  as console commands)
 are missed.
 
 So it really does need to check up on things during restart to get back in
 sync,  but also to find what it didn't know about from info not going
 through an interface it is aware of.

Well at least that explains the reason for the current state of affairs. 
Thanks for the find.


-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] Nepomuk indexing, what triggers it?

2010-11-19 Thread Alex Schuster
Alan McKinnon writes:

 If I reboot this machine and start KDE, Nepomuk starts a rather
 long-lived index of my home directory. It takes up about 30-40% cpu
 and lasts as much as 15 minutes sometimes. This is annoying because
 after a reboot I usually want to catch up on mail, rss feeds and fire
 up VirtualBox. So nepomuk is just wasting my time at this point.
 
 How does nepomuk know when to do it's thing, how can I tweak what it
 does and how can I discover why it feels it necessary to reindex my
 entire maildir when surely it has a perfectly valid index already from
 just before I shut down?

I think it starts scanning everything over again at every login. I've been 
also annoyed by that, so I deactivated it, and activate it from time to 
time when I am away, so it won't bother me.
Or you can have it active, and during login you can suspend Strigi's 
indexing by right-clicking on the Nepomuk/Strigi icon in the panel.

You might be interested in this article that came up on the Planet KDE RSS 
feed yesterday:
http://www.afiestas.org/nepomuk-is-not-fast-is-instant/

It suggests to set fs.inotify.max_user_watches to something quite large 
like 524288 via sysctl. I assume this is the number of directories being 
monitored with inotify, and if this is larger than the total number of 
directories, changes in a directory will be noticed at once. So maybe this 
will avoid the periodic scanning at all? I did not try this yet. But it 
won't stop the first scan after login.

I think I will have to trim the list of directories to index. Currently, I 
selected my and another user's $HOME, and some data directories. This 
gives 666,000 files, which is probably a lot. So I guess I'll skip my 
MP3s, as they are indexed already by Amarok, and also those many 
directories with source code.

Wonko