On Среда 20 марта 2013 19:57:45 Vishesh Handa wrote: > On Wed, Mar 20, 2013 at 7:39 PM, <[email protected]> wrote: > > On Вторник 19 марта 2013 23:35:42 Vishesh Handa wrote: > > > As your guys might remember, we moved away from Strigi for the 4.10 > > > release. Our solution however, still does not support any document > > > > formats > > > > > apart from PDF. We need to change that and support other formats. > > > There > > > > are > > > > > 2 possible ways to go about this - > > > > > > 1. We use Okular which supports a number of popular formats > > > 2. We write our own indexers by using the relevant library. > > > > I know I risk starting a flamewar, or more likely, there's no risk, and > > instead > > > a 100% guarantee, but: > Not really. It was mostly just a decision taken by me. > > > 3. Use libStreamAnalyzer. > > > > Take a look back at how many tiny issues and corner cases had to be > > fixed > > so > > far, how many lib quirks had to be accounted for? This was also the most > > significant source of troubles for libstreamanalyzer. > > The main reason I'm against this is Strigi does not have a maintainer. Bugs > keep propping up - It doesn't handle all kinds of odf files, docs files, > etc. I do not want to have to fix them.
But now Nepomuk file indexer needs a maintainer. > Also, we're fundamentally > duplicating work. Libraries already exist to parse those file formats, and > they are actively being used all across kde. We can just reuse those > libraries instead of having our own parsers, and maintaining them. Which was never a problem for lsa, eg ffmpeg plugin. Noone volunteered to write an Okular plugin or massage TagLib people into making public their stream- based api, which is used internally and wrapped by the file-based public api. In fact, the plugin architecture was intended to allow kde apps and libs to ship analyzers based on their format-specific libs. Oh, and of course libs have bugs too. You either report them and patiently wait for a fix, or fix it yourself. Eg ffmpeg may crash on some malformed or exotic file, and it isn't a big problem for the majority of its user base(redownload the file, delete it, open with another tool). Crashing analyzer is very bad for Nepomuk. > What this duplication of effort has accomplished so far? And what happens > > > if or > > hopefully when Nepomuk outgrows this file-based sandbox? > > The duplication of effort has been quite small. > > Currently all of the indexing code in Nepomuk which is doing 80% of the > Strigi's job is about 1400 lines of code. In comparison the code required > to just interface with Strigi in Nepomuk was a good 700 lines. Also, now > with our 2 tier approach, Strigi would be giving us data which has already > been pushed. One could remove that data and all, but it's just not > something I want to do. LSA indexers can be selectively enabled, so 2 or X tier approach has been supported for ages but apparently not used. As to interface code, rdfindexer util from strigi is definitely smaller than 700 lines of code > I'm not sure when we will outgrow this file-based sandbox, but based on our > current requirements, we do not need anything more than file handling. The > other additional stuff that Strigi used to provide was just discarded. I can definitely see at least 1 use case: akonadi and providing metadata for attachments. Yes, you can always download and store that 30 MB attachment to a temp location, do the file analysis, but imap4 was specifically intended to avoid this. It's a rather bad idea to design frameworks based on immediate requirements. It's an ok approach for a quick and dirty hack or a tool, but a strategic mistake for a framework. -- Evgeny _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
