Re: [Nepomuk] [RFC] New File Indexer

Dean Perry Tue, 11 Sep 2012 08:04:31 -0700

Ok, so I'm no expert on yet Nepomuk or Strigi, but I am investing time in 
coming up to speed with them.

Vishesh Handa wrote:

>  I don't think this entire port should take me more than a week. 

I'll bet you a beer this is still being discussed a year from now :-)

> This month I'm focusing on the file indexing part of Nepomuk, and right now 
> it takes forever for Strigi to index all
> my files.

well, I feel and share your pain, but I wonder... the file indexer has been 
banging away on my machine for at least 14 
hours now (I'm on Kubuntu 4.9, so no patch for the reindexing thing... anyway). 
 I have been mostly away from my machine 
or doing light browsing/email for that time so Other than me writing this mail, 
firefox and the usual system/session 
stuff, no other demands on the CPU.

Most of the 70% CPU utilization is Virtuoso, with blips every few seconds of 3% 
or so for nepomindex process instances.  

There is practically no disk I/O at all (500ms every 50-70s) - all my indexable 
folders are on a physically distinct 
drive so it's easy to notice.

So my complaint is : why isn't the index using more resources?
(ie: it appears not to use resources when it could, and too many resources when 
it shouldn't, which is kind of the 
reverse of how you want it).

>  I'm not the only one with this problem. We already have another project 
> called the nepomuk-metadata-extractor [1] 
which implements the following indexers -
* PDF ( Poppler Based )

yeah, the Poppler pdfinfo already extracts more data than the current PDF 
indexer, I had been thinking about this 
personally.  Go Jörg!

>  I would like to move these indexers into nepomuk-core [...] It would then 
> call the appropriate indexing class (if one 
exists) which would populate the SimpleResourceGraph or it would just add the 
appropriate rdf types.

I think you have it "inside out"; it needs to be *more pluggable* and instead 
make it easier to write a replacement 
indexer for a given MIME type and perhaps find a clever way to factor Nepomuk 
domain specific knowledge from file-type 
expertise.

For example, off the top of my head, I can think of at least ten different type 
of file I would want indexed;  I'm sure 
that everyone here could name ten different types.  It is an endless and 
thankless task.

As evidence - Jörg wrote:
> This will help a lot to make indexing better and easier to contribute.
> Strigi seems to be a very powerful solution. But writing the
> streamanalyzers or fixing in them isn't very intuitive.

So, four suggestions (not sure how much of this is already done now):

(1) Indexer framework is data agnostic, only finds files/resources for 
indexing; two jobs only
  - {a} wrangling which process to launch for MIME type, resource allocation 
and preemptive termination of that process. 
  - {b} handling triplets supplied by process; simple validation and 
transaction support in case of crash or other 
preemptive termination.

Why? Language agnostic indexer code; C++, bash, assembler, Python, Erlang or 
javascript.  Whatever works for the 
resource type in question.  It only has to know about being a regular process.

(2) Support multiple resources (of same type) per process (for launch 
efficiency)

framework can keep a table of discovered resources of a given MIME type and 
when it has enough (10? 20?) launch the 
right process.  maybe in the future we grade each indexer as lightweight or 
piggy and we decide to launch several sets 
of processes for several MIME types in parallel.

(3) Support chains of processing per resource.

Why? So as not to rely on having to re-implement features of previous indexer.  
Say I write an mpeg 4 parser to extract 
closed caption text; I do not have to reimplement Trueg's TV Show stuff. 

Order of operation might be important - post processing seems like something 
that several people have asked about and 
I'm certainly interested in "hooking" onto indexer to capture each freshly 
completed file.

(4) Perhaps hand each process a handle (socket? dbus?) to write to

Yeah, I've been reading about 'systemd' :-)

Imagine the simplest indexer that adds only resource/tag/value triplets - it 
just becomes just two nested loops:
 -  iterate over resources
 -- iterate over meta data items.
 --- Test if resource contains item 1 (eg: jpeg/exif exposure), output triple 
for item 1
 --- Test if resource contains item 2 (eg: jpeg/exif iso), output triple for 
item 2
 - exit.

What I'm trying to get at here is that if I have some document type that I am 
expert in or that good library support 
already exists (eg: JPEG, PDF, mp3 are good examples) then all I need to do is 
take a list of files and spit out 
triples, rather than understand how to plug into the framework.

The only Nepomuk domain specific knowledge I need is the correct property URI 
and the appropriate format for the values 
of such properties.

Anyway, enough already :-)

dean

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

Re: [Nepomuk] [RFC] New File Indexer

Reply via email to