Re: Is this the right way to work with large number of N-Triples?

Paolo Castagna Thu, 17 Mar 2011 01:05:01 -0700

About test failing strange... I don't see failures:
Tests run: 41, Failures: 0, Errors: 0, Skipped: 0
Share details on your failures, I might have a look (but not today).


If you are keen, you can look at EARQ as well which is not just about 
ElasticSearch.
It was done to experiment with a refactoring which made easier to plug-in 
different
indexes... and indeed EARQ has Lucene, Solr and ElasticSearch in it):
https://github.com/castagna/EARQ

Paolo

Anuj Kumar wrote:

Sure, I will let you know in case I have any queries. The tests werefailing when I built SARQ on my machine but I will look into it later.As you mentioned, it is really good to understand the integration withLARQ as a reference. So, I am doing that.


Thanks for the info.

- Anuj

On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna<[email protected] <mailto:[email protected]>>wrote:




    Anuj Kumar wrote:

        Thanks Paolo. I am looking into LARQ and also SARQ.


    Be warned: SARQ is just an experiment (and currently unsupported).
    However, if you prefer to use Solr, share with us you use case and
    your reasons
    and let me know if you have problems with it.

    SARQ might be a little bit behind in relation to the removals from
    the index,
    but you can look at what LARQ does and port the same approach into SARQ.

    Paolo



        On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
        [email protected]
        <mailto:[email protected]>> wrote:


            Anuj Kumar wrote:

                Hi Andy,

                I have loaded few n-triples into TDB in the offline mode
                using tdbloader.
                Loading as well as query is fast but if I try to use a
                regex, it becomes
                very slow. It is taking few minutes. On my 32bit machine
                it takes more
                than
                10 mins (expected due to limited memory ~ 1.5GB) and on
                my 64bit machine
                (8GB) it takes around 5 mins.

                The query is pretty exhaustive, correct me if it is
                happening due to the
                filter-

                SELECT ?abstract
                WHERE {
                 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
                 FILTER regex(?l, "Futurama", "i") .
                 ?resource <http://dbpedia.org/ontology/abstract> ?abstract
                }

                I have loaded few abstracts from dbpedia dump and I am
                trying to get the
                abstracts from the label. This is very slow. If I remove
                the FILTER and
                give
                the exact label, it is fast (should be because of TDB
                indexing).

                What is the right way to do such regex search or text
                search over the
                graph?
                I have seen suggestions to use Lucene and I also saw the
                LARQ initiative.
                Is
                that the right way to go?

            Yes, using LARQ (which is included in ARQ) will greatly
            speed up your
            query.
            LARQ documentation is here:
            http://jena.sourceforge.net/ARQ/lucene-arq.html
            You will need to build the Lucene index first, though.

            Paolo



                Thanks,
                Anuj

                On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
                [email protected]
                <mailto:[email protected]>> wrote:

                 Just so you know: The TDB bulkloader can load all the
                data offline - it's

                    faster than using Fuseki for data loading online.

                         Andy


                    On 15/03/11 11:22, Anuj Kumar wrote:

                     Hi Andy,

                        Thanks for the info. I have loaded few GBs using
                        Fuseki Server but I
                        didn't
                        try RiotReader or Java APIs for TDB. Will try that.
                        Thanks for the response.

                        Regards,
                        Anuj

                        On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
                        [email protected]
                        <mailto:[email protected]>>  wrote:

                         1/ Have you considered reading the DBpedia data
                        into TDB?  This would

                            keep
                            the triples on-disk (and have cached
                            in-memory versions of a subset).

                            2/ A file can be read sequentially by using
                            the parser directly (See
                            RiotReader and pass in a Sink<Triple>  that
                            processes the stream of
                            triples).

                                 Andy


                            On 14/03/11 18:42, Anuj Kumar wrote:

                             Hi All,

                                I am new to Jena and trying to explore
                                it to work with large number of
                                N-Triples. The requirement is to read
                                large number of N-Triples. For
                                example, a nt file from DBpedia dump
                                that may run into GBs. I have to
                                read
                                these triples, pick specific ones and
                                further link it to the resource
                                of
                                another set of triples. The goal is to
                                link some of the entities based
                                on
                                Linked Data concept. Once the mapping is
                                done, I have to query the
                                model
                                from that point onwards. I don't want to
                                work by loading both the
                                source
                                and
                                target dataset in-memory.

                                To achieve this, I have first created a
                                file model maker and then a
                                named
                                model for the specific dataset being
                                mapped. Now, I need to read the
                                Triples
                                and add the mapping to this new model.
                                What should be the right
                                approach?

                                One way is to load the model using
                                FileManager and iterate through the
                                statements and map them accordingly to
                                the named model (i.e. our
                                mapped
                                model) and at the end close it. This
                                will work, but it will load all
                                of
                                the
                                triples in memory. Is this the right way
                                to proceed or is there a way
                                to
                                read the model sequentially at the time
                                of mapping?

                                Just trying to understand the efficient
                                way to map large set of
                                N-Triples.
                                Need your suggestions.

                                Thanks,
                                Anuj

Re: Is this the right way to work with large number of N-Triples?

Reply via email to