Hi Vishesh, >> 2. symlink handling seems to be gone - if a file is symlinked to two >> places it now gets indexed twice. (Maybe you knew about this?) > > I remember Sebastian fixing system link handling, but I've written a lot of > code from scratch so I'll have to check it out again. I'll add it to my list > of things to do.
Cool. > Either will do. I haven't really thought about it. I just don't want to do > too much effort. If some library (like qt) can extract the data for us, I > rather us it, instead of writing the parsing code on our own. > > Do you want to start writing some plugins? Are you ready for me to? For html I would probably use webkit or khtml (if qt can't handle it), and for latex I would probably copy-paste detex (http://code.google.com/p/opendetex/), since it doesn't really have a library implementation. I can wait until the interfaces are more mature if you like though. >> 4. A number of times while indexing, I got the error message: >> 'nepomukindexer(13152)/nepomuk (strigi service): SimpleIndexerError: >> "http://www.w3.org/1999/02/22-rdf-syntax-ns#type has a rdfs:range of >> http://www.w3.org/2000/01/rdf-schema#Class" ' (not sure what it >> means). > > It means, that I, in my hurry have not written good plugins. There are > pushing in correct data, and Nepomuk won't let them. In this case it seems I > have added a property (rdf:type, something) where the something should be a > class, but it is not. Probably a typo somewhere. Ok. Quite understandable :) Simeon >> On 20 September 2012 18:25, Vishesh Handa <[email protected]> wrote: >> > Another update >> > >> > I've pushed my changes into the feature/newIndexer branch. If someone >> > could >> > review it, it would be nice. >> > >> > The current architecture consists of 2 queues - BasicIndexingQueue and >> > FileIndexingQueues. >> > >> > The BasicIndexing queues just extracts the mimetype, stat results and >> > url. >> > On my system, with the latest Soprano, I manage around 10 files per >> > second. >> > This queue is NOT throttled in any way, and make virtuoso peak around >> > 70% of >> > one cpu. I'm still working on reducing this. I would ideally like this >> > part >> > to not be noticeable. Even if it is working on full speed. >> > >> > The FileIndexingQueue calls the 'nepomukindexer' process which extracts >> > the >> > actual metadata from the file. It only works when the system is IDLE. >> > This >> > is monitored using the KIdleTime, which is not that great, since I could >> > have left a compiling job and during that time I don't want the file >> > indexing to start. Ditto when watching an HD movie. >> > >> > Here is what is left - >> > >> > 1. The Nepomuk Controller widget needs to be updated properly. I'm not >> > sure >> > if I should inform the controller about the basic indexing. Any >> > opinions? >> > >> > 2. Event Monitoring - Pausing on battery and all. For now the old >> > approach >> > is being used that nothing gets indexed when on battery, but I'm not >> > sure if >> > that is a good idea. I think I'm going to change it to only pause the >> > file >> > indexing queue when on battery. >> > >> > 3. Separate Process - It is not required at all. I would however like to >> > keep it for debugging purposes. If none has any problems, I'll stop the >> > new >> > process approach, but still keep the nepomukindexer executable. >> > >> > 4. Plugin Interface - They are currently called Extractors which is a >> > lousy >> > name, but I couldn't come up with anything better. We need a better name >> > and >> > a proper interface. I've just hacked together a plugin system without >> > thinking about the future design too much. This can be a good thing and >> > a >> > bad thing. >> > >> > We will have to release a public interface for 4.10. Specially, if we >> > want >> > other people to write plugins. >> > >> > 5. Plugins - They are only 5 plugins so far, and I have no plans of >> > writing >> > any more. They are extremely simple to write, and my time is better >> > spent >> > doing other things. I think this is an amazing place to get people >> > interested. So, we need to finalize (4) so that I can blog about it and >> > start talking about it. >> > >> > 6. Packagers - I talked to Will (Open Suse) about the new approach, and >> > they >> > would like the plugins to be in a separate tarball / repo. It's a lot >> > easier >> > for them to ship it that way. I have no problem with that. Does anyone >> > have >> > any opinions? >> > >> > 7. Needs a proper review - Someone (not just Sebastian) needs to review >> > the >> > code. The Nepomuk related part isn't that much, and it's not scary. So >> > please review it. I'd like a proper "Ship it" before I merge it into >> > master, >> > and I would like to get it into master this month. >> > >> > That's about it :) >> > >> > On Wed, Sep 12, 2012 at 9:18 PM, Vishesh Handa <[email protected]> wrote: >> >> >> >> Hey everyone >> >> >> >> Quick update. We have analyzers for - >> >> >> >> * taglib >> >> * exiv2 >> >> * ffmpeg >> >> * pdf >> >> * plain text files >> >> >> >> Documents are still a problem. I've contacted the Calligra team. I'll >> >> let >> >> you know what they say. >> >> >> >> The analyzers work pretty well. I might just code an epub based >> >> analyzer >> >> today. >> >> >> >> Tomorrow, I'll start working on a plugin based architecture, and adding >> >> two queues in the index scheduler. One which will immediately call the >> >> SimpleIndexer to just save the basic metadata, and the other one will >> >> only >> >> work when on idle. It'll do the proper indexing for the file. >> >> >> >> The obvious problem to this approach is that we need a way of saying >> >> that >> >> this file has passed the first indexing level, and needs to go through >> >> the >> >> second level. Maybe a new property for that? >> >> >> >> >> >> On Tue, Sep 11, 2012 at 8:18 PM, Sebastian Trüg <[email protected]> >> >> wrote: >> >>> >> >>> I like this. >> >>> But I would vote for a plugin system nonetheless. A simple one though. >> >>> A >> >>> plugin can register for one or more mimetypes and then it simply gets >> >>> the >> >>> file path and returns a SimpleResourceGraph. You merge all and are >> >>> done. >> >>> Plugins should never deal with file size, mimetype, or any of those >> >>> basic >> >>> things the framework can handle. >> >>> >> >>> This means that the first sweep is done without plugins, the second >> >>> one >> >>> would call the plugins and the third one, well, that could be yet >> >>> another >> >>> plugin system which does use RDF types instead of mimetypes. For >> >>> example: >> >>> the TV show plugin handles nfo:Video. The framework thus calls the >> >>> plugin, >> >>> provides the path and a handle to the existing metadata. The plugin >> >>> can then >> >>> simply run its filename analysis and continue from there. >> >>> >> >>> OK, one issue we have here is the following: the tv show extractor for >> >>> example works better when run on sets of video files, preferably a >> >>> whole >> >>> season. Then it only needs to get feedback from the user once or can >> >>> even do >> >>> its job automatically. This, however, means that third-sweep plugins >> >>> would >> >>> need an option "can-handle-more-than-one-file-at-a-time". >> >>> >> >>> My 2cents. >> >>> >> >>> >> >>> On 09/11/2012 04:06 PM, Alex Fiestas wrote: >> >>>> >> >>>> I think we've discussed this somewhere but I don't remember the >> >>>> outcome >> >>>> of the >> >>>> discussion xD >> >>>> >> >>>> I think that would be really interesting to have an indexer that does >> >>>> a >> >>>> 2pass >> >>>> strategy. >> >>>> >> >>>> First pass will index only basic data such a name, dates, mimetype. >> >>>> >> >>>> Second pass will index specific stuff, previews, texts, tags... >> >>>> >> >>>> Doing this, we can even add third party "information fetchers" as a 3 >> >>>> pass, >> >>>> for example to get information about tv shows and such. >> >>>> >> >>>> Let's put an example: >> >>>> >> >>>> -New file in my Downlaod folder detected >> >>>> -Quick super fast indexer indexs data, name, mimetype >> >>>> From this point, this file is already usable in Nepomuk >> >>>> -Second pass, indexing tags, previews >> >>>> -Third pass (this can be onDemand via GUI) information from the >> >>>> internetz is >> >>>> fetched. >> >>>> >> >>>> I got this idea from spotlight (osx indexer metadata thing), the most >> >>>> obvious >> >>>> way of seeing this in osx is when a new external storage is plugged, >> >>>> files >> >>>> will get indexed super fast but all you will get if you perform a >> >>>> search >> >>>> is >> >>>> going ot be filenames, not even mimetypes ! >> >>>> >> >>>> Cheerz. >> >>>> _______________________________________________ >> >>>> Nepomuk mailing list >> >>>> [email protected] >> >>>> https://mail.kde.org/mailman/listinfo/nepomuk >> >>>> >> >>> _______________________________________________ >> >>> Nepomuk mailing list >> >>> [email protected] >> >>> https://mail.kde.org/mailman/listinfo/nepomuk >> >> >> >> >> >> >> >> >> >> -- >> >> Vishesh Handa >> >> >> > >> > >> > >> > -- >> > Vishesh Handa >> > >> > >> > _______________________________________________ >> > Nepomuk mailing list >> > [email protected] >> > https://mail.kde.org/mailman/listinfo/nepomuk >> > > > > > > -- > Vishesh Handa > _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
