On Thursday, January 06, 2011 07:09:50 am Devshree Sane wrote:

> I am trying to use DBpedia for one of my projects. All I want is to iterate
> over the nodes in this
> set<http://downloads.dbpedia.org/3.5.1/en/article_categories_en.nt.bz2>.
> It has 10925705 triples. The Model.read(..) methods read all triples at once
> in memory.

No, it reads into whatever the Model says. If the Model is an SDB
model, the triples go into the database. If it's a TDB model, they
get stored in TDB files.

The *default* Models are memory-based.

> However I have only 2GB RAM available, and hence I get "heap
> space errors" or "GC limit exceeded errors".
> Is there a BufferedIterator available for this purpose(which will not load
> the entire graph in memory)?

No, but ...

> If not, is there any other way this can be achieved?

... yes.

Subclass GraphSink and override PerformAdd to do whatever you want.
Then make a Model from that Graph and model.read your RDF through
it. 

> (Persistent storage via TDB seems an overkill for this)

Why? Youy're all set up for doing it again then, and you can run
ad-hoc local queries against the data if you want to.

> I am wondering why such a feature is not already in Jena?

No real call for it. A single pass through the data doesn't let
you exploit RDF very much -- you're just seeing triples in
pseudo-random order. (You could always sort the ntriples
data to get clustering but you're still not able to use cross-links.)

Chris

-- 
"Feel the world turning upside-down"              - The Reasoning, /Dark Angel/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)

Reply via email to