Hi all, now that the WebMiner is working and in extragear I like to talk about how this could be integrated better into the current indexer. The current solution works as an additional service that listens to all newly added resources and calls the webminer in a QProcess.
Vishesh had the idea to combine this in the current indexer chain which will help to control the process better (suspend/resume based on battery status and so on) I've checked the source and saw that currently there exist the basicindexer which fetches mimetype stuff and the fileindexer, that takes all resources with the property "kext:indexingLevel < 2" and extracts additional information (former strigi indexer) At this point I like to introduce the Webminer with the proper queue/job like the fileindexer and work on all properties with "kext:indexingLevel == 2 or < 3". The WebMinerIndexerJob would call my current webminer, which would go into nepomuk-core too (as a subfolder like the fileindexer) The parts I like to put into nepomuk-core would be my plugin based webextraction + some basic python plugins. So all parts I have for the WebMiner at the moment without all the ui parts. This would not change the build dependencies but add a few more runtime dependencies. In order to successfully fetch the data from the web we would need the python modules * re * json * urllib * httplib2 * tvdb * musicbrainzngz * as well as the krosspython plugin This would allow to fetch: * music data + cover from musicbrainz * movie data + poster from themoviedb. (imdb is not working anymore and way to unstable and slow) * tvshow data +banner from thetvdb * document data from microsoft academics/spingerlink Any additional plugins. Which is currently the broken imdb(hopefully this will be fixed in the future) as well as the extended tvdbmal script that needs also pxKDE/pyQt and probably more should go in some kind of extragear repository or even kde-apps for those who like to fetch data from other resources. nepomuk-core could at least fetch most data out-of-the box then. The current indexing can than be controlled via the overall indexing status and shown in the nepomuk-controller that sits in the systemtray. The current ui that can be used to manually find and save the metadata would go somewhere else (kde-runtime/workspace or where ever it might fit) The biggest problem might be the generation of the SimpleResource classes, which takes a very long time currently. Hopefully this can be fixed too, as this problem should be solved by any program that will use them in the future anyway. Any other ideas, suggestion or comments? Would the mentioned runtime python dependencies work or will they still be a problem? The good thing here, even if those runtime dependencies are missing, the user won't get a broken desktop. Instead the additional data will just not be fetched from the web. Regards, Jörg _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
