Re: [Nepomuk] Strigi Feeder

Vishesh Handa Wed, 14 Jul 2010 01:36:53 -0700

Small bug fix. The previous version would add the metadata even if the
resource was found. Oops!


Will test more.

- Vishesh Handa

On Wed, Jul 14, 2010 at 1:51 AM, Vishesh Handa <[email protected]> wrote:

> I've cleaned up the code, and added some comments. It works perfectly. It
> would be really nice if somebody (hint Trueg) could review the code.
>
> I'm posting a short summary of what the code does -*
>
> PROBLEM:*
> The Strigi analyzer on analyzing a file creates additional metadata which
> is linked to the file's metadata. Example - When indexing an a audio file,
> say, "Coldplay - Yellow" from the Album 'X&Y'. It will create 2 additional
> resources one of type nco:Contact and the other of type nmm:MusicAlbum. It
> will do that for every indexed song that has the artist 'Coldplay' and album
> 'X&Y'. Nepomuk simply adds all the data to the database without checking if
> similar contacts or albums exist. This leads to multiple contacts, albums ..
> with the same names, and makes queries harder to perform ( and longer ).
>
> Additionally, some files may not contain totally accurate Metadata. For
> example - I have a song whose metadata says that it has 2 artists both of
> whom are called "Coldplay" (exact same spelling) The Strigi analyzer creates
> 2 different resources for both of these identical contacts. They should be
> merged.
>
> Additionally, all the metadata created ( even the contacts, albums, etc )
> were contained in the same discardable graph. So when the file was deleted
> the additional metadata was deleted as well.
> *
> SOLUTION:*
> The Nepomuk Indexer ( kdebase/runtime/nepomuk/strigibackend/indexerwriter.*
> ) now contains an additional thread, which takes all the statements from the
> IndexWriter, resolves duplicates and merges them. It has been done in a
> separate thread so that the indexing speed does not suffer.
>
> The current patch checks for blank Nodes in the object / subject of the
> file's metadata, and tries to find them or creates them if not present. The
> patch reverts to a the original behavior if any of the additionally
> generated metadata ( contacts, albums) contain any blank nodes. in order to
> fix this, a full blown dependency resolution algorithm would be required. I
> don't think that it is currently required.
>
> The patch also creates a different graph ( discardable ) for each
> individual resource.
>
> *Problem not fixed :*
> This will only work on newly indexed files and does not affect the files
> which have already been indexed. We'll need some kind of merger to do that.
> It's a lot simpler to just re-index the files, but I don't think the end
> users would like that.
>
> *A New Problem :
> *Since the additional metadata now has it's own graph. It will not be
> deleted if the file is deleted. We need some kind of cleaner which cleans
> resources which are no longer in use.
>
> And, that's about it.
>
> - Vishesh Handa
>
>
> On Tue, Jul 13, 2010 at 7:42 PM, Vishesh Handa <[email protected]>wrote:
>
>> Yes, I finally implemented it. :-D
>>
>> Please note that this is just the initial design. If you don't like the
>> API design, or anything in particular, please tell me!
>>
>> I've debugged it, and it seems to running okay, but I'll test it more
>> thoroughly, and benchmark it later. For what it's worth, it seems to be
>> somewhat faster.
>>
>> There is one obvious bug in the implementation which I've highlighted.
>> There are ways to fix it, but that would make the code messier than it
>> already is, and AFIAK it currently isn't a problem, but it could be in the
>> future.
>>
>> - Vishesh Handa
>>
>>
>

Index: strigifeeder.h
===================================================================
--- strigifeeder.h	(revision 0)
+++ strigifeeder.h	(revision 0)
@@ -0,0 +1,94 @@
+/*
+  Copyright (C) 2010 Vishesh Handa <[email protected]>
+
+  This library is free software; you can redistribute it and/or
+  modify it under the terms of the GNU General Public License as
+  published by the Free Software Foundation; either version 2 of
+  the License, or (at your option) any later version.
+
+  This library is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+  Library General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this library; see the file COPYING.  If not, write to
+  the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
+  Boston, MA 02110-1301, USA.
+*/
+
+
+#ifndef STRIGIFEEDER_H
+#define STRIGIFEEDER_H
+
+#include <QtCore/QThread>
+#include <QtCore/QMutex>
+#include <QtCore/QWaitCondition>
+#include <QtCore/QUrl>
+#include <QtCore/QQueue>
+#include <QtCore/QStack>
+#include <QtCore/QSet>
+
+namespace Soprano {
+    class Model;
+    class Statement;
+    class Node;
+}
+
+namespace Nepomuk {
+    class StrigiFeeder : public QThread
+    {
+        Q_OBJECT
+    public:
+        StrigiFeeder( Soprano::Model* model, QObject* parent = 0);
+        virtual ~StrigiFeeder();
+
+        void stop();
+        void run();
+
+    public Q_SLOTS:
+        void begin( const QUrl & uri );
+
+        /**
+         * Adds \p st to the list of statements to be added.
+         * \p st may contain Blank Nodes.The context is ignored.
+         * Should be called between begin and end
+         *
+         * \sa begin end
+         */
+        void addStatement( const Soprano::Statement & st );
+
+        /**
+         * Adds the subject, predicate, object to the list of statements
+         * to be added. The Subject or Object may contain Blank Nodes
+         * Should be called between begin and end
+         *
+         * \sa begin end
+         */
+        void addStatement( const Soprano::Node & subject,
+                           const Soprano::Node & predicate,
+                           const Soprano::Node & object );
+
+        void end();
+
+    private:
+        struct Request {
+            QUrl uri;
+            QSet<Soprano::Statement> statements;
+        };
+        QQueue<Request> m_queue;
+        QStack<Request> m_stack;
+
+        Soprano::Model* m_model;
+
+        QMutex m_queueMutex;
+        QWaitCondition m_queueWaiter;
+        bool m_stopped;
+
+        /// Generates a discardable graph for \p resourceUri
+        QUrl generateGraph( const QUrl& resourceUri );
+    };
+
+}
+
+#endif // STRIGIFEEDER_H
Index: nepomukindexwriter.cpp
===================================================================
--- nepomukindexwriter.cpp	(revision 1149492)
+++ nepomukindexwriter.cpp	(working copy)
@@ -1,5 +1,6 @@
 /*
   Copyright (C) 2007-2010 Sebastian Trueg <[email protected]>
+  Copyright (C) 2010 Vishesh Handa <[email protected]>
 
   This library is free software; you can redistribute it and/or
   modify it under the terms of the GNU General Public License as
@@ -22,6 +23,7 @@
 #include "nfo.h"
 #include "nie.h"
 #include "nrl.h"
+#include "strigifeeder.h"
 
 #include <Soprano/Soprano>
 #include <Soprano/Vocabulary/RDF>
@@ -118,6 +120,25 @@ namespace {
         return uri;
     }
 
+    class RegisteredFieldData
+    {
+    public:
+        RegisteredFieldData( const QUrl& prop, QVariant::Type t )
+        : property( prop ),
+        dataType( t ),
+        isRdfType( prop == Vocabulary::RDF::type() ) {
+        }
+
+        /// The actual property URI
+        QUrl property;
+
+        /// the literal range of the property (if applicable)
+        QVariant::Type dataType;
+
+        /// caching QUrl comparison
+        bool isRdfType;
+    };
+
     /**
      * Data objects that are used to store information relative to one
      * indexing run.
@@ -128,7 +149,7 @@ namespace {
         FileMetaData( const Strigi::AnalysisResult* idx );
 
         /// stores basic data including the nie:url and the nrl:GraphMetadata in \p model
-        void storeBasicData( Soprano::Model* model );
+        void storeBasicData( Nepomuk::StrigiFeeder* feeder );
 
         /// map a blank node to a resource
         QUrl mapNode( const std::string& s );
@@ -142,37 +163,12 @@ namespace {
         /// The file info - saved to prevent multiple stats
         QFileInfo fileInfo;
 
-        /// The URI of the graph that contains all indexed statements
-        QUrl context;
-
         /// a buffer for all plain-text content generated by strigi
         std::string content;
 
     private:
         /// The Strigi result
         const Strigi::AnalysisResult* m_analysisResult;
-
-        /// mapping from blank nodes used in addTriplet to our urns
-        QMap<std::string, QUrl> m_blankNodeMap;
-    };
-
-    class RegisteredFieldData
-    {
-    public:
-        RegisteredFieldData( const QUrl& prop, QVariant::Type t )
-            : property( prop ),
-              dataType( t ),
-              isRdfType( prop == Vocabulary::RDF::type() ) {
-        }
-
-        /// The actual property URI
-        QUrl property;
-
-        /// the literal range of the property (if applicable)
-        QVariant::Type dataType;
-
-        /// caching QUrl comparison
-        bool isRdfType;
     };
 
     FileMetaData::FileMetaData( const Strigi::AnalysisResult* idx )
@@ -185,92 +181,64 @@ namespace {
         // this will automatically find previous uses of the file in question
         // with backwards compatibility
         resourceUri = Nepomuk::Resource( fileUrl ).resourceUri();
-
-        // use a new random context URI
-        context = Nepomuk::ResourceManager::instance()->generateUniqueUri( "ctx" );
-    }
-
-    QUrl FileMetaData::mapNode( const std::string& s )
-    {
-        if ( s[0] == ':' ) {
-            if( m_blankNodeMap.contains( s ) ) {
-                return m_blankNodeMap[s];
-            }
-            else {
-                QUrl urn = Nepomuk::ResourceManager::instance()->generateUniqueUri( QString() );
-                m_blankNodeMap.insert( s, urn );
-                return urn;
-            }
-        }
-        // special case to properly handle nie:isPartOf relations created for containers
-        else if ( s == m_analysisResult->path() ) {
-            return resourceUri;
-        }
-        else {
-            return QUrl::fromEncoded( s.c_str() );
-        }
     }
 
-    void FileMetaData::storeBasicData( Soprano::Model* model )
+    void FileMetaData::storeBasicData( Nepomuk::StrigiFeeder * feeder )
     {
-        model->addStatement( resourceUri, Nepomuk::Vocabulary::NIE::url(), fileUrl, context );
+        feeder->addStatement( resourceUri, Nepomuk::Vocabulary::NIE::url(), fileUrl );
 
         // Strigi only indexes files and extractors mostly (if at all) store the nie:DataObject type (i.e. the contents)
         // Thus, here we go the easy way and mark each indexed file as a nfo:FileDataObject.
-        model->addStatement( resourceUri,
+        feeder->addStatement( resourceUri,
                              Vocabulary::RDF::type(),
-                             Nepomuk::Vocabulary::NFO::FileDataObject(),
-                             context );
+                              Nepomuk::Vocabulary::NFO::FileDataObject() );
         if ( fileInfo.isDir() ) {
-            model->addStatement( resourceUri,
+            feeder->addStatement( resourceUri,
                                  Vocabulary::RDF::type(),
-                                 Nepomuk::Vocabulary::NFO::Folder(),
-                                 context );
+                                  Nepomuk::Vocabulary::NFO::Folder() );
         }
-
-
-        // create the provedance data for the data graph
-        // TODO: add more data at some point when it becomes of interest
-        QUrl metaDataContext = Nepomuk::ResourceManager::instance()->generateUniqueUri( "ctx" );
-        model->addStatement( context,
-                             Vocabulary::RDF::type(),
-                             Nepomuk::Vocabulary::NRL::DiscardableInstanceBase(),
-                             metaDataContext );
-        model->addStatement( context,
-                             Vocabulary::NAO::created(),
-                             LiteralValue( QDateTime::currentDateTime() ),
-                             metaDataContext );
-        model->addStatement( context,
-                             Strigi::Ontology::indexGraphFor(),
-                             resourceUri,
-                             metaDataContext );
-        model->addStatement( metaDataContext,
-                             Vocabulary::RDF::type(),
-                             Nepomuk::Vocabulary::NRL::GraphMetadata(),
-                             metaDataContext );
-        model->addStatement( metaDataContext,
-                             Nepomuk::Vocabulary::NRL::coreGraphMetadataFor(),
-                             context,
-                             metaDataContext );
     }
 
     FileMetaData* fileDataForResult( const Strigi::AnalysisResult* idx )
     {
         return static_cast<FileMetaData*>( idx->writerData() );
     }
+
+    Soprano::Node createNode( const std::string & str ) {
+        QString identifier = QString::fromUtf8( str.c_str() );
+
+        if( !identifier.isEmpty() && identifier[0] == ':' ) {
+            identifier.remove( 0, 1 );
+            return Soprano::Node::createBlankNode( identifier );
+        }
+
+        //Not a blank node
+        return Soprano::Node( QUrl(identifier) );
+    }
 }
 
 
 class Strigi::NepomukIndexWriter::Private
 {
 public:
-    Private()
+    Private( Soprano::Model * model )
+        : repository( model )
     {
         literalTypes[FieldRegister::stringType] = QVariant::String;
         literalTypes[FieldRegister::floatType] = QVariant::Double;
         literalTypes[FieldRegister::integerType] = QVariant::Int;
         literalTypes[FieldRegister::binaryType] = QVariant::ByteArray;
         literalTypes[FieldRegister::datetimeType] = QVariant::DateTime; // Strigi encodes datetime as unsigned integer, i.e. addValue( ..., uint )
+
+        feeder = new Nepomuk::StrigiFeeder( model );
+        feeder->start();
+    }
+
+    ~Private()
+    {
+        feeder->stop();
+        feeder->wait();
+        delete feeder;
     }
 
     QVariant::Type literalType( const Strigi::FieldProperties& strigiType ) {
@@ -310,6 +278,8 @@ public:
 
     QStack<const Strigi::AnalysisResult*> currentResultStack;
 
+    Nepomuk::StrigiFeeder* feeder;
+
 private:
     QHash<std::string, QVariant::Type> literalTypes;
 };
@@ -318,8 +288,7 @@ private:
 Strigi::NepomukIndexWriter::NepomukIndexWriter( Soprano::Model* model )
     : Strigi::IndexWriter()
 {
-    d = new Private;
-    d->repository = model;
+    d = new Private( model );
     Util::storeStrigiMiniOntology( d->repository );
 }
 
@@ -387,8 +356,12 @@ void Strigi::NepomukIndexWriter::startAn
     if ( data->resourceUri.isEmpty() )
         data->resourceUri = Nepomuk::ResourceManager::instance()->generateUniqueUri( QString() );
 
+    // Start the feeder
+    kDebug() << "Starting the feeder";
+    d->feeder->begin( data->resourceUri );
+
     // store initial data to make sure newly created URIs are reused directly by libnepomuk
-    data->storeBasicData( d->repository );
+    data->storeBasicData( d->feeder );
 
     // remember the file data
     idx->setWriterData( data );
@@ -419,7 +392,7 @@ void Strigi::NepomukIndexWriter::addValu
         RegisteredFieldData* rfd = reinterpret_cast<RegisteredFieldData*>( field->writerData() );
 
         // the statement we will create, we will determine the object below
-        Soprano::Statement statement( md->resourceUri, rfd->property, Soprano::Node(), md->context );
+        Soprano::Statement statement( md->resourceUri, rfd->property, Soprano::Node() );
 
         //
         // Strigi uses rdf:type improperly since it stores the value as a string. We have to
@@ -461,12 +434,12 @@ void Strigi::NepomukIndexWriter::addValu
             if ( value[0] == ':' ) {
                 Nepomuk::Types::Property property( rfd->property );
                 if ( property.range().isValid() ) {
-                    statement.setObject( md->mapNode( value ) );
+                    statement.setObject( createNode( value ) );
                 }
             }
         }
 
-        d->repository->addStatement( statement );
+        d->feeder->addStatement( statement );
     }
 }
 
@@ -504,10 +477,7 @@ void Strigi::NepomukIndexWriter::addValu
         val = QDateTime::fromTime_t( value );
     }
 
-    d->repository->addStatement( Statement( md->resourceUri,
-                                            rfd->property,
-                                            val,
-                                            md->context) );
+    d->feeder->addStatement( md->resourceUri, rfd->property, val);
 }
 
 
@@ -522,10 +492,7 @@ void Strigi::NepomukIndexWriter::addValu
     FileMetaData* md = fileDataForResult( idx );
     RegisteredFieldData* rfd = reinterpret_cast<RegisteredFieldData*>( field->writerData() );
 
-    d->repository->addStatement( Statement( md->resourceUri,
-                                            rfd->property,
-                                            LiteralValue( value ),
-                                            md->context) );
+    d->repository->addStatement( md->resourceUri, rfd->property, LiteralValue( value ) );
 }
 
 
@@ -540,10 +507,7 @@ void Strigi::NepomukIndexWriter::addValu
     FileMetaData* md = fileDataForResult( idx );
     RegisteredFieldData* rfd = reinterpret_cast<RegisteredFieldData*>( field->writerData() );
 
-    d->repository->addStatement( Statement( md->resourceUri,
-                                            rfd->property,
-                                            LiteralValue( value ),
-                                            md->context) );
+    d->repository->addStatement( md->resourceUri, rfd->property, LiteralValue( value ) );
 }
 
 
@@ -555,17 +519,17 @@ void Strigi::NepomukIndexWriter::addTrip
         return;
     }
 
-    FileMetaData* md = fileDataForResult( d->currentResultStack.top() );
+    //FileMetaData* md = fileDataForResult( d->currentResultStack.top() );
 
-    QUrl subject = md->mapNode( s );
-    Nepomuk::Types::Property property( md->mapNode( p ) );
+    Soprano::Node subject( createNode( s ) );
+    Nepomuk::Types::Property property( QUrl( QString::fromUtf8(p.c_str()) ) ); // Was mapped earlier
     Soprano::Node object;
     if ( property.range().isValid() )
-        object = md->mapNode( o );
+        object = Soprano::Node( createNode( o ) );
     else
         object = Soprano::LiteralValue::fromString( QString::fromUtf8( o.c_str() ), property.literalRangeType().dataTypeUri() );
 
-    d->repository->addStatement( subject, property.uri(), object, md->context );
+    d->feeder->addStatement( subject, property.uri(), object );
 }
 
 
@@ -582,17 +546,17 @@ void Strigi::NepomukIndexWriter::finishA
 
     // store the full text of the file
     if ( md->content.length() > 0 ) {
-        d->repository->addStatement( Statement( md->resourceUri,
+        d->feeder->addStatement( md->resourceUri,
                                                 Nepomuk::Vocabulary::NIE::plainTextContent(),
-                                                LiteralValue( QString::fromUtf8( md->content.c_str() ) ),
-                                                md->context ) );
-        if ( d->repository->lastError() )
-            kDebug() << "Failed to add" << md->resourceUri << "as text" << QString::fromUtf8( md->content.c_str() );
+                                 LiteralValue( QString::fromUtf8( md->content.c_str() ) ) );
     }
 
     // cleanup
     delete md;
     idx->setWriterData( 0 );
+
+    // Handle the feeder
+    d->feeder->end();
 }
 
 
Index: CMakeLists.txt
===================================================================
--- CMakeLists.txt	(revision 1149492)
+++ CMakeLists.txt	(working copy)
@@ -11,6 +11,7 @@ set( strigi_nepomuk_indexer_SRCS
   nepomukindexmanager.cpp
   nepomukindexreader.cpp
   nepomukindexwriter.cpp
+  strigifeeder.cpp
   util.cpp
 )
 
Index: strigifeeder.cpp
===================================================================
--- strigifeeder.cpp	(revision 0)
+++ strigifeeder.cpp	(revision 0)
@@ -0,0 +1,290 @@
+/*
+  Copyright (C) 2010 Vishesh Handa <[email protected]>
+
+  This library is free software; you can redistribute it and/or
+  modify it under the terms of the GNU General Public License as
+  published by the Free Software Foundation; either version 2 of
+  the License, or (at your option) any later version.
+
+  This library is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+  Library General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this library; see the file COPYING.  If not, write to
+  the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor,
+  Boston, MA 02110-1301, USA.
+*/
+
+
+#include "strigifeeder.h"
+#include "nrl.h"
+#include "util.h"
+
+#include <QtCore/QDateTime>
+
+#include <Soprano/Model>
+#include <Soprano/Statement>
+#include <Soprano/QueryResultIterator>
+#include <Soprano/Vocabulary/RDF>
+#include <Soprano/Vocabulary/NAO>
+
+#include <Nepomuk/ResourceManager>
+#include <Nepomuk/Resource>
+
+#include <KDebug>
+
+
+Nepomuk::StrigiFeeder::StrigiFeeder(Soprano::Model* model, QObject* parent)
+    : QThread( parent ),
+      m_model( model )
+{
+    m_stopped = false;
+}
+
+
+Nepomuk::StrigiFeeder::~StrigiFeeder()
+{
+}
+
+
+void Nepomuk::StrigiFeeder::begin( const QUrl & url )
+{
+    //kDebug() << "BEGINING";
+    Request req;
+    req.uri = url;
+
+    m_stack.push( req );
+}
+
+
+void Nepomuk::StrigiFeeder::addStatement(const Soprano::Statement& st)
+{
+    Q_ASSERT( !m_stack.isEmpty() );
+    Request & req = m_stack.top();
+
+    // Since we are adding them to a set, duplicates are automatically resolved
+    req.statements.insert( st );
+}
+
+
+void Nepomuk::StrigiFeeder::addStatement(const Soprano::Node& subject, const Soprano::Node& predicate, const Soprano::Node& object)
+{
+    addStatement( Soprano::Statement( subject, predicate, object, Soprano::Node() ) );
+}
+
+
+void Nepomuk::StrigiFeeder::end()
+{
+    if( m_stack.isEmpty() )
+        return;
+    //kDebug() << "ENDING";
+
+    Request req = m_stack.pop();
+
+    m_queueMutex.lock();
+    m_queue.enqueue( req );
+
+    m_queueMutex.unlock();
+    m_queueWaiter.wakeAll();
+}
+
+
+void Nepomuk::StrigiFeeder::stop()
+{
+    QMutexLocker lock( &m_queueMutex );
+    m_stopped = true;
+    m_queueWaiter.wakeAll();
+}
+
+namespace {
+
+    struct ResourceStruct {
+        QUrl uri;
+        QMultiHash<QUrl, Soprano::Node> propHash;
+    };
+
+    // Maps the uri to the ResourceStuct
+    typedef QHash<QUrl, ResourceStruct> ResourceHash;
+
+    ResourceHash convertToResourceHash(const QSet<Soprano::Statement> & set ) {
+        ResourceHash hash;
+
+        foreach( const Soprano::Statement & st, set ) {
+            //kDebug() << st;
+            const Soprano::Node & n = st.subject();
+            QUrl uriOrId;
+            if( n.isResource() )
+                uriOrId = n.uri();
+            else if( n.isBlank() )
+                uriOrId = n.identifier();
+
+            if( !hash.contains( uriOrId ) ) {
+                ResourceStruct rs;
+                if( n.isResource() )
+                    rs.uri = n.uri();
+
+                hash.insert( uriOrId, rs );
+            }
+
+            ResourceStruct & rs = hash[ uriOrId ];
+            rs.propHash.insert( st.predicate().uri(), st.object() );
+        }
+        return hash;
+    }
+
+    /**
+     * Creates a sparql query which returns 1 resource which matches all the properties,
+     * and objects present in the propHash of the ResourceStruct
+     */
+    QString toSparql( const ResourceStruct & rs ) {
+        QString query = QString::fromLatin1("select distinct ?r where { ");
+
+        QList<QUrl> keys = rs.propHash.uniqueKeys();
+        foreach( const QUrl & prop, keys ) {
+            const QList<Soprano::Node>& values = rs.propHash.values( prop );
+
+            foreach( const Soprano::Node & n, values ) {
+                query += " ?r " + Soprano::Node::resourceToN3( prop ) + " " + n.toN3() + " . ";
+            }
+        }
+        query += " } LIMIT 1";
+        return query;
+    }
+
+    /**
+     * Adds all the statements present in the ResourceStruct to the \p model.
+     * The contex is \p context
+     */
+    void add( Soprano::Model * model, const ResourceStruct &rs, const QUrl & context ) {
+        QHashIterator<QUrl, Soprano::Node> iter( rs.propHash );
+        while( iter.hasNext() ) {
+            iter.next();
+
+            Soprano::Statement st( rs.uri, iter.key(), iter.value(), context );
+            //kDebug() << "ADDING : " << st;
+            model->addStatement( st );
+        }
+    }
+}
+
+//BUG: When indexing a file, there is one main uri ( in Request ) and other additional uris
+//     If there is a statement connecting the main uri with the additional ones, it will be
+//     resolved correctly, but not if one of the additional one links to another additional one.
+void Nepomuk::StrigiFeeder::run()
+{
+    m_stopped = false;
+    while( !m_stopped ) {
+
+        // lock for initial iteration
+        m_queueMutex.lock();
+
+        // work the queue
+        while( !m_queue.isEmpty() ) {
+            Request request = m_queue.dequeue();
+
+            // unlock after queue utilization
+            m_queueMutex.unlock();
+
+            //kDebug() << " Converting to ResourceHash ..";
+            // Convert to Resource Hash
+            ResourceHash hash = convertToResourceHash( request.statements );
+
+            // Search for the resources or create them
+            //kDebug() << " Searching for duplicates or creating them ... ";
+            QMutableHashIterator<QUrl, ResourceStruct> it( hash );
+            while( it.hasNext() ) {
+                it.next();
+
+                // If it already exists
+                ResourceStruct & rs = it.value();
+                if( !rs.uri.isEmpty() )
+                    continue;
+
+                QString query = toSparql( rs );
+                //kDebug() << query;
+                Soprano::QueryResultIterator it =  m_model->executeQuery( query, Soprano::Query::QueryLanguageSparql );
+
+                if( it.next() ) {
+                    //kDebug() << "Found exact match " << rs.uri << " " << it[0].uri();
+                    rs.uri = it[0].uri();
+                }
+                else {
+                    //kDebug() << "Creating ..";
+                    rs.uri = ResourceManager::instance()->generateUniqueUri( QString() );
+
+                    // Add to the repository
+                    QUrl context = generateGraph( rs.uri );
+                    add( m_model, rs, context );
+                }
+            }
+
+            // Fix links for main
+            ResourceStruct & rs = hash[ request.uri ];
+            QMutableHashIterator<QUrl, Soprano::Node> iter( rs.propHash );
+            while( iter.hasNext() ) {
+                iter.next();
+                Soprano::Node & n = iter.value();
+
+                if( n.isEmpty() )
+                    continue;
+
+                if( n.isBlank() ) {
+                    const QString & id = n.identifier();
+                    if( !hash.contains( id ) )
+                        continue;
+                    QUrl newUri = hash.value( id ).uri;
+                    //kDebug() << id << " ---> " << newUri;
+                    iter.value() = Soprano::Node( newUri );
+                }
+            }
+
+            // Add main file to the repository
+            QUrl context = generateGraph( rs.uri );
+            add( m_model, rs, context );
+
+            // lock for next iteration
+            m_queueMutex.lock();
+        }
+
+        // wait for more input
+        kDebug() << "Waiting...";
+        m_queueWaiter.wait( &m_queueMutex );
+        m_queueMutex.unlock();
+        kDebug() << "Woke up.";
+
+    }
+}
+
+
+QUrl Nepomuk::StrigiFeeder::generateGraph( const QUrl & resourceUri )
+{
+    QUrl context = Nepomuk::ResourceManager::instance()->generateUniqueUri( "ctx" );
+
+    // create the provedance data for the data graph
+    // TODO: add more data at some point when it becomes of interest
+    QUrl metaDataContext = Nepomuk::ResourceManager::instance()->generateUniqueUri( "ctx" );
+    m_model->addStatement( context,
+                           Soprano::Vocabulary::RDF::type(),
+                           Nepomuk::Vocabulary::NRL::DiscardableInstanceBase(),
+                           metaDataContext );
+    m_model->addStatement( context,
+                           Soprano::Vocabulary::NAO::created(),
+                           Soprano::LiteralValue( QDateTime::currentDateTime() ),
+                           metaDataContext );
+    m_model->addStatement( context,
+                           Strigi::Ontology::indexGraphFor(),
+                           resourceUri,
+                           metaDataContext );
+    m_model->addStatement( metaDataContext,
+                           Soprano::Vocabulary::RDF::type(),
+                           Nepomuk::Vocabulary::NRL::GraphMetadata(),
+                           metaDataContext );
+    m_model->addStatement( metaDataContext,
+                           Nepomuk::Vocabulary::NRL::coreGraphMetadataFor(),
+                           context,
+                           metaDataContext );
+
+    return context;
+}

_______________________________________________
Nepomuk mailing list
[email protected]
https://mail.kde.org/mailman/listinfo/nepomuk

Re: [Nepomuk] Strigi Feeder

Reply via email to