Re: Is this the right way to work with large number of N-Triples?

Anuj Kumar Thu, 17 Mar 2011 00:52:32 -0700

Sure, I will let you know in case I have any queries. The tests were failing
when I built SARQ on my machine but I will look into it later. As you
mentioned, it is really good to understand the integration with LARQ as a
reference. So, I am doing that.


Thanks for the info.

- Anuj

On Thu, Mar 17, 2011 at 1:14 PM, Paolo Castagna <
[email protected]> wrote:

>
>
> Anuj Kumar wrote:
>
>> Thanks Paolo. I am looking into LARQ and also SARQ.
>>
>
> Be warned: SARQ is just an experiment (and currently unsupported).
> However, if you prefer to use Solr, share with us you use case and your
> reasons
> and let me know if you have problems with it.
>
> SARQ might be a little bit behind in relation to the removals from the
> index,
> but you can look at what LARQ does and port the same approach into SARQ.
>
> Paolo
>
>
>
>> On Thu, Mar 17, 2011 at 12:18 AM, Paolo Castagna <
>> [email protected]> wrote:
>>
>>
>>> Anuj Kumar wrote:
>>>
>>>  Hi Andy,
>>>>
>>>> I have loaded few n-triples into TDB in the offline mode using
>>>> tdbloader.
>>>> Loading as well as query is fast but if I try to use a regex, it becomes
>>>> very slow. It is taking few minutes. On my 32bit machine it takes more
>>>> than
>>>> 10 mins (expected due to limited memory ~ 1.5GB) and on my 64bit machine
>>>> (8GB) it takes around 5 mins.
>>>>
>>>> The query is pretty exhaustive, correct me if it is happening due to the
>>>> filter-
>>>>
>>>> SELECT ?abstract
>>>> WHERE {
>>>>  ?resource <http://www.w3.org/2000/01/rdf-schema#label> ?l .
>>>>  FILTER regex(?l, "Futurama", "i") .
>>>>  ?resource <http://dbpedia.org/ontology/abstract> ?abstract
>>>> }
>>>>
>>>> I have loaded few abstracts from dbpedia dump and I am trying to get the
>>>> abstracts from the label. This is very slow. If I remove the FILTER and
>>>> give
>>>> the exact label, it is fast (should be because of TDB indexing).
>>>>
>>>> What is the right way to do such regex search or text search over the
>>>> graph?
>>>> I have seen suggestions to use Lucene and I also saw the LARQ
>>>> initiative.
>>>> Is
>>>> that the right way to go?
>>>>
>>>>  Yes, using LARQ (which is included in ARQ) will greatly speed up your
>>> query.
>>> LARQ documentation is here:
>>> http://jena.sourceforge.net/ARQ/lucene-arq.html
>>> You will need to build the Lucene index first, though.
>>>
>>> Paolo
>>>
>>>
>>>
>>>  Thanks,
>>>> Anuj
>>>>
>>>> On Tue, Mar 15, 2011 at 5:09 PM, Andy Seaborne <
>>>> [email protected]> wrote:
>>>>
>>>>  Just so you know: The TDB bulkloader can load all the data offline -
>>>> it's
>>>>
>>>>> faster than using Fuseki for data loading online.
>>>>>
>>>>>      Andy
>>>>>
>>>>>
>>>>> On 15/03/11 11:22, Anuj Kumar wrote:
>>>>>
>>>>>  Hi Andy,
>>>>>
>>>>>> Thanks for the info. I have loaded few GBs using Fuseki Server but I
>>>>>> didn't
>>>>>> try RiotReader or Java APIs for TDB. Will try that.
>>>>>> Thanks for the response.
>>>>>>
>>>>>> Regards,
>>>>>> Anuj
>>>>>>
>>>>>> On Tue, Mar 15, 2011 at 4:12 PM, Andy Seaborne<
>>>>>> [email protected]>  wrote:
>>>>>>
>>>>>>  1/ Have you considered reading the DBpedia data into TDB?  This would
>>>>>>
>>>>>>  keep
>>>>>>> the triples on-disk (and have cached in-memory versions of a subset).
>>>>>>>
>>>>>>> 2/ A file can be read sequentially by using the parser directly (See
>>>>>>> RiotReader and pass in a Sink<Triple>  that processes the stream of
>>>>>>> triples).
>>>>>>>
>>>>>>>      Andy
>>>>>>>
>>>>>>>
>>>>>>> On 14/03/11 18:42, Anuj Kumar wrote:
>>>>>>>
>>>>>>>  Hi All,
>>>>>>>
>>>>>>>  I am new to Jena and trying to explore it to work with large number
>>>>>>>> of
>>>>>>>> N-Triples. The requirement is to read large number of N-Triples. For
>>>>>>>> example, a nt file from DBpedia dump that may run into GBs. I have
>>>>>>>> to
>>>>>>>> read
>>>>>>>> these triples, pick specific ones and further link it to the
>>>>>>>> resource
>>>>>>>> of
>>>>>>>> another set of triples. The goal is to link some of the entities
>>>>>>>> based
>>>>>>>> on
>>>>>>>> Linked Data concept. Once the mapping is done, I have to query the
>>>>>>>> model
>>>>>>>> from that point onwards. I don't want to work by loading both the
>>>>>>>> source
>>>>>>>> and
>>>>>>>> target dataset in-memory.
>>>>>>>>
>>>>>>>> To achieve this, I have first created a file model maker and then a
>>>>>>>> named
>>>>>>>> model for the specific dataset being mapped. Now, I need to read the
>>>>>>>> Triples
>>>>>>>> and add the mapping to this new model. What should be the right
>>>>>>>> approach?
>>>>>>>>
>>>>>>>> One way is to load the model using FileManager and iterate through
>>>>>>>> the
>>>>>>>> statements and map them accordingly to the named model (i.e. our
>>>>>>>> mapped
>>>>>>>> model) and at the end close it. This will work, but it will load all
>>>>>>>> of
>>>>>>>> the
>>>>>>>> triples in memory. Is this the right way to proceed or is there a
>>>>>>>> way
>>>>>>>> to
>>>>>>>> read the model sequentially at the time of mapping?
>>>>>>>>
>>>>>>>> Just trying to understand the efficient way to map large set of
>>>>>>>> N-Triples.
>>>>>>>> Need your suggestions.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anuj
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>

Re: Is this the right way to work with large number of N-Triples?

Reply via email to