Re: Is this the right way to work with large number of N-Triples?

Paolo Castagna Thu, 17 Mar 2011 00:44:47 -0700


Anuj Kumar wrote:

Thanks Paolo. I am looking into LARQ and also SARQ.


Be warned: SARQ is just an experiment (and currently unsupported).
However, if you prefer to use Solr, share with us you use case and your reasons
and let me know if you have problems with it.

SARQ might be a little bit behind in relation to the removals from the index,
but you can look at what LARQ does and port the same approach into SARQ.

Paolo


On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
[email protected]> wrote:


Anuj Kumar wrote:

Hi Andy,

I have loaded few n-triples into TDB in the offline mode using tdbloader.
Loading as well as query is fast but if I try to use a regex, it becomes
very slow. It is taking few minutes. On my 32bit machine it takes more
than
10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
(8GB) it takes around 5 mins.

The query is pretty exhaustive, correct me if it is happening due to the
filter-

SELECT ?abstract
WHERE {
 ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
 FILTER regex(?l, "Futurama", "i") .
 ?resource <http://dbpedia.org/ontology/abstract> ?abstract
}

I have loaded few abstracts from dbpedia dump and I am trying to get the
abstracts from the label. This is very slow. If I remove the FILTER and
give
the exact label, it is fast (should be because of TDB indexing).

What is the right way to do such regex search or text search over the
graph?
I have seen suggestions to use Lucene and I also saw the LARQ initiative.
Is
that the right way to go?

Yes, using LARQ (which is included in ARQ) will greatly speed up your
query.
LARQ documentation is here:
http://jena.sourceforge.net/ARQ/lucene-arq.html
You will need to build the Lucene index first, though.

Paolo

Thanks,
Anuj

On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
[email protected]> wrote:

 Just so you know: The TDB bulkloader can load all the data offline - it's

faster than using Fuseki for data loading online.

      Andy


On 15/03/11 11:22, Anuj Kumar wrote:

 Hi Andy,

Thanks for the info. I have loaded few GBs using Fuseki Server but I
didn't
try RiotReader or Java APIs for TDB. Will try that.
Thanks for the response.

Regards,
Anuj

On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
[email protected]>  wrote:

 1/ Have you considered reading the DBpedia data into TDB?  This would

keep
the triples on-disk (and have cached in-memory versions of a subset).

2/ A file can be read sequentially by using the parser directly (See
RiotReader and pass in a Sink<Triple>  that processes the stream of
triples).

      Andy


On 14/03/11 18:42, Anuj Kumar wrote:

 Hi All,

I am new to Jena and trying to explore it to work with large number of
N-Triples. The requirement is to read large number of N-Triples. For
example, a nt file from DBpedia dump that may run into GBs. I have to
read
these triples, pick specific ones and further link it to the resource
of
another set of triples. The goal is to link some of the entities based
on
Linked Data concept. Once the mapping is done, I have to query the
model
from that point onwards. I don't want to work by loading both the
source
and
target dataset in-memory.

To achieve this, I have first created a file model maker and then a
named
model for the specific dataset being mapped. Now, I need to read the
Triples
and add the mapping to this new model. What should be the right
approach?

One way is to load the model using FileManager and iterate through the
statements and map them accordingly to the named model (i.e. our
mapped
model) and at the end close it. This will work, but it will load all
of
the
triples in memory. Is this the right way to proceed or is there a way
to
read the model sequentially at the time of mapping?

Just trying to understand the efficient way to map large set of
N-Triples.
Need your suggestions.

Thanks,
Anuj

Re: Is this the right way to work with large number of N-Triples?

Reply via email to