Re: Is this the right way to work with large number of N-Triples?

Anuj Kumar Thu, 17 Mar 2011 01:31:30 -0700

Sure. Will take a look at that as well. Interesting!
Regarding SARQ, I just tried it once. The errors were related to clean up of
Solr indexes during the tests. Here are the details-


 INFO [33039485@qtp-1012673-2] (SolrCore.java:1324) - [sarq] webapp=/solr
path=/update params={wt=ja
vabin&version=1} status=500 QTime=5
ERROR [33039485@qtp-1012673-2] (SolrException.java:139) -
java.io.IOException: Cannot delete .\solr\
sarq\data\index\lucene-d12b45df2c6d6ae2efebf4cb75b8da25-write.lock
        at
org.apache.lucene.store.NativeFSLockFactory.clearLock(NativeFSLockFactory.java:143)
        at org.apache.lucene.store.Directory.clearLock(Directory.java:141)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1541)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1402)
        at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:190)
        at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteAll(DirectUpdateHandler2.java:167)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:323)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFacto
ry.java:71)
        at
org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:234)
        at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:180)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBa
se.java:54)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
        at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:440)
        at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:943)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

ERROR [Finalizer] (SolrIndexWriter.java:242) - SolrIndexWriter was not
closed prior to finalize(), i
ndicates a bug -- POSSIBLE RESOURCE LEAK!!!
 INFO [33039485@qtp-1012673-2] (DirectUpdateHandler2.java:165) - [sarq]
REMOVING ALL DOCUMENTS FROM
INDEX
 INFO [33039485@qtp-1012673-2] (LogUpdateProcessorFactory.java:171) - {} 0 5
ERROR [33039485@qtp-1012673-2] (SolrException.java:139) -
java.io.IOException: Cannot delete .\solr\
sarq\data\index\lucene-d12b45df2c6d6ae2efebf4cb75b8da25-write.lock
        at
org.apache.lucene.store.NativeFSLockFactory.clearLock(NativeFSLockFactory.java:143)
        at org.apache.lucene.store.Directory.clearLock(Directory.java:141)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1541)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1402)
        at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:190)
        at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteAll(DirectUpdateHandler2.java:167)
        at
org.apache.solr.update.DirectUpdateHandler2.deleteByQuery(DirectUpdateHandler2.java:323)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processDelete(RunUpdateProcessorFacto
ry.java:71)
        at
org.apache.solr.handler.XMLLoader.processDelete(XMLLoader.java:234)
        at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:180)
        at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBa
se.java:54)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
        at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
        at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
        at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:440)
        at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
        at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:943)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
        at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

I can take a look at it but first I need to understand the integration
point.

Thanks,
Anuj

On Thu, Mar 17, 2011 at 1:34 PM, Paolo Castagna <
[email protected]> wrote:

> About test failing strange... I don't see failures:
> Tests run: 41, Failures: 0, Errors: 0, Skipped: 0
> Share details on your failures, I might have a look (but not today).
>
> If you are keen, you can look at EARQ as well which is not just about
> ElasticSearch.
> It was done to experiment with a refactoring which made easier to plug-in
> different
> indexes... and indeed EARQ has Lucene, Solr and ElasticSearch in it):
> https://github.com/castagna/EARQ
>
> Paolo
>
> Anuj Kumar wrote:
>
>> Sure, I will let you know in case I have any queries. The tests were
>> failing when I built SARQ on my machine but I will look into it later. As
>> you mentioned, it is really good to understand the integration with LARQ as
>> a reference. So, I am doing that.
>>
>> Thanks for the info.
>>
>> - Anuj
>>
>> On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna <
>> [email protected] <mailto:[email protected]>>
>> wrote:
>>
>>
>>
>>    Anuj Kumar wrote:
>>
>>        Thanks Paolo. I am looking into LARQ and also SARQ.
>>
>>
>>    Be warned: SARQ is just an experiment (and currently unsupported).
>>    However, if you prefer to use Solr, share with us you use case and
>>    your reasons
>>    and let me know if you have problems with it.
>>
>>    SARQ might be a little bit behind in relation to the removals from
>>    the index,
>>    but you can look at what LARQ does and port the same approach into
>> SARQ.
>>
>>    Paolo
>>
>>
>>
>>        On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
>>        [email protected]
>>        <mailto:[email protected]>> wrote:
>>
>>
>>            Anuj Kumar wrote:
>>
>>                Hi Andy,
>>
>>                I have loaded few n-triples into TDB in the offline mode
>>                using tdbloader.
>>                Loading as well as query is fast but if I try to use a
>>                regex, it becomes
>>                very slow. It is taking few minutes. On my 32bit machine
>>                it takes more
>>                than
>>                10 mins (expected due to limited memory ~ 1.5GB) and on
>>                my 64bit machine
>>                (8GB) it takes around 5 mins.
>>
>>                The query is pretty exhaustive, correct me if it is
>>                happening due to the
>>                filter-
>>
>>                SELECT ?abstract
>>                WHERE {
>>                 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l
>> .
>>                 FILTER regex(?l, "Futurama", "i") .
>>                 ?resource <http://dbpedia.org/ontology/abstract>
>> ?abstract
>>                }
>>
>>                I have loaded few abstracts from dbpedia dump and I am
>>                trying to get the
>>                abstracts from the label. This is very slow. If I remove
>>                the FILTER and
>>                give
>>                the exact label, it is fast (should be because of TDB
>>                indexing).
>>
>>                What is the right way to do such regex search or text
>>                search over the
>>                graph?
>>                I have seen suggestions to use Lucene and I also saw the
>>                LARQ initiative.
>>                Is
>>                that the right way to go?
>>
>>            Yes, using LARQ (which is included in ARQ) will greatly
>>            speed up your
>>            query.
>>            LARQ documentation is here:
>>            http://jena.sourceforge.net/ARQ/lucene-arq.html
>>            You will need to build the Lucene index first, though.
>>
>>            Paolo
>>
>>
>>
>>                Thanks,
>>                Anuj
>>
>>                On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>>                [email protected]
>>                <mailto:[email protected]>> wrote:
>>
>>                 Just so you know: The TDB bulkloader can load all the
>>                data offline - it's
>>
>>                    faster than using Fuseki for data loading online.
>>
>>                         Andy
>>
>>
>>                    On 15/03/11 11:22, Anuj Kumar wrote:
>>
>>                     Hi Andy,
>>
>>                        Thanks for the info. I have loaded few GBs using
>>                        Fuseki Server but I
>>                        didn't
>>                        try RiotReader or Java APIs for TDB. Will try that.
>>                        Thanks for the response.
>>
>>                        Regards,
>>                        Anuj
>>
>>                        On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>                        [email protected]
>>                        <mailto:[email protected]>>  wrote:
>>
>>
>>                         1/ Have you considered reading the DBpedia data
>>                        into TDB?  This would
>>
>>                            keep
>>                            the triples on-disk (and have cached
>>                            in-memory versions of a subset).
>>
>>                            2/ A file can be read sequentially by using
>>                            the parser directly (See
>>                            RiotReader and pass in a Sink<Triple>  that
>>                            processes the stream of
>>                            triples).
>>
>>                                 Andy
>>
>>
>>                            On 14/03/11 18:42, Anuj Kumar wrote:
>>
>>                             Hi All,
>>
>>                                I am new to Jena and trying to explore
>>                                it to work with large number of
>>                                N-Triples. The requirement is to read
>>                                large number of N-Triples. For
>>                                example, a nt file from DBpedia dump
>>                                that may run into GBs. I have to
>>                                read
>>                                these triples, pick specific ones and
>>                                further link it to the resource
>>                                of
>>                                another set of triples. The goal is to
>>                                link some of the entities based
>>                                on
>>                                Linked Data concept. Once the mapping is
>>                                done, I have to query the
>>                                model
>>                                from that point onwards. I don't want to
>>                                work by loading both the
>>                                source
>>                                and
>>                                target dataset in-memory.
>>
>>                                To achieve this, I have first created a
>>                                file model maker and then a
>>                                named
>>                                model for the specific dataset being
>>                                mapped. Now, I need to read the
>>                                Triples
>>                                and add the mapping to this new model.
>>                                What should be the right
>>                                approach?
>>
>>                                One way is to load the model using
>>                                FileManager and iterate through the
>>                                statements and map them accordingly to
>>                                the named model (i.e. our
>>                                mapped
>>                                model) and at the end close it. This
>>                                will work, but it will load all
>>                                of
>>                                the
>>                                triples in memory. Is this the right way
>>                                to proceed or is there a way
>>                                to
>>                                read the model sequentially at the time
>>                                of mapping?
>>
>>                                Just trying to understand the efficient
>>                                way to map large set of
>>                                N-Triples.
>>                                Need your suggestions.
>>
>>                                Thanks,
>>                                Anuj
>>
>>
>>
>>
>>
>>
>>

Re: Is this the right way to work with large number of N-Triples?

Reply via email to