Ok, so I'm no expert on yet Nepomuk or Strigi, but I am investing time in
coming up to speed with them.
Vishesh Handa wrote:
> I don't think this entire port should take me more than a week.
I'll bet you a beer this is still being discussed a year from now :-)
> This month I'm focusing on the file indexing part of Nepomuk, and right now
> it takes forever for Strigi to index all
> my files.
well, I feel and share your pain, but I wonder... the file indexer has been
banging away on my machine for at least 14
hours now (I'm on Kubuntu 4.9, so no patch for the reindexing thing... anyway).
I have been mostly away from my machine
or doing light browsing/email for that time so Other than me writing this mail,
firefox and the usual system/session
stuff, no other demands on the CPU.
Most of the 70% CPU utilization is Virtuoso, with blips every few seconds of 3%
or so for nepomindex process instances.
There is practically no disk I/O at all (500ms every 50-70s) - all my indexable
folders are on a physically distinct
drive so it's easy to notice.
So my complaint is : why isn't the index using more resources?
(ie: it appears not to use resources when it could, and too many resources when
it shouldn't, which is kind of the
reverse of how you want it).
> I'm not the only one with this problem. We already have another project
> called the nepomuk-metadata-extractor [1]
which implements the following indexers -
* PDF ( Poppler Based )
yeah, the Poppler pdfinfo already extracts more data than the current PDF
indexer, I had been thinking about this
personally. Go Jörg!
> I would like to move these indexers into nepomuk-core [...] It would then
> call the appropriate indexing class (if one
exists) which would populate the SimpleResourceGraph or it would just add the
appropriate rdf types.
I think you have it "inside out"; it needs to be *more pluggable* and instead
make it easier to write a replacement
indexer for a given MIME type and perhaps find a clever way to factor Nepomuk
domain specific knowledge from file-type
expertise.
For example, off the top of my head, I can think of at least ten different type
of file I would want indexed; I'm sure
that everyone here could name ten different types. It is an endless and
thankless task.
As evidence - Jörg wrote:
> This will help a lot to make indexing better and easier to contribute.
> Strigi seems to be a very powerful solution. But writing the
> streamanalyzers or fixing in them isn't very intuitive.
So, four suggestions (not sure how much of this is already done now):
(1) Indexer framework is data agnostic, only finds files/resources for
indexing; two jobs only
- {a} wrangling which process to launch for MIME type, resource allocation
and preemptive termination of that process.
- {b} handling triplets supplied by process; simple validation and
transaction support in case of crash or other
preemptive termination.
Why? Language agnostic indexer code; C++, bash, assembler, Python, Erlang or
javascript. Whatever works for the
resource type in question. It only has to know about being a regular process.
(2) Support multiple resources (of same type) per process (for launch
efficiency)
framework can keep a table of discovered resources of a given MIME type and
when it has enough (10? 20?) launch the
right process. maybe in the future we grade each indexer as lightweight or
piggy and we decide to launch several sets
of processes for several MIME types in parallel.
(3) Support chains of processing per resource.
Why? So as not to rely on having to re-implement features of previous indexer.
Say I write an mpeg 4 parser to extract
closed caption text; I do not have to reimplement Trueg's TV Show stuff.
Order of operation might be important - post processing seems like something
that several people have asked about and
I'm certainly interested in "hooking" onto indexer to capture each freshly
completed file.
(4) Perhaps hand each process a handle (socket? dbus?) to write to
Yeah, I've been reading about 'systemd' :-)
Imagine the simplest indexer that adds only resource/tag/value triplets - it
just becomes just two nested loops:
- iterate over resources
-- iterate over meta data items.
--- Test if resource contains item 1 (eg: jpeg/exif exposure), output triple
for item 1
--- Test if resource contains item 2 (eg: jpeg/exif iso), output triple for
item 2
- exit.
What I'm trying to get at here is that if I have some document type that I am
expert in or that good library support
already exists (eg: JPEG, PDF, mp3 are good examples) then all I need to do is
take a list of files and spit out
triples, rather than understand how to plug into the framework.
The only Nepomuk domain specific knowledge I need is the correct property URI
and the appropriate format for the values
of such properties.
Anyway, enough already :-)
dean
_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk