Re: BulkLoader error with large data and fast harddrive

jp Tue, 28 Jun 2011 13:20:02 -0700

Hey Andy,

Saw the twitter message 29% load speed increase is pretty nice. Glad I
could give you the excuse to upgrade :) Though It worries me that you
don't receive the same exception I do. I consistently have loading
issues using the file posted at
http://www.kosmyna.com/mappingbased_properties_en.nt.bz2. I can get
the test program to complete by making the following changes but it's
slow (30 minutes).


SystemTDB.setFileMode(FileMode.direct) ;

 if ( true ) {
     String dir = "/home/jp/scratch/ssdtest/DB-X" ;
     FileOps.clearDirectory(dir) ;
     datasetGraph = TDBFactory.createDatasetGraph(dir);
 }

Running the program with the sections of code below fails every time.

//SystemTDB.setFileMode(FileMode.direct) ;

 if ( true ) {
     String dir = "/home/jp/scratch/ssdtest/DB-X" ;
     FileOps.clearDirectory(dir) ;
     datasetGraph = TDBFactory.createDatasetGraph(dir);
 }

The exception:
java.lang.IllegalArgumentException
        at java.nio.Buffer.position(Buffer.java:235)
        at 
com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:94)
        at 
com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:95)
        at 
com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:41)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeRecords.getSplitKey(BPTreeRecords.java:141)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.split(BPTreeNode.java:435)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:387)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
        at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
        at 
com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
        at com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
        at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
        at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
        at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
        at 
com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)
http://dbpedia.org/resource/Spirea_X
http://dbpedia.org/ontology/associatedBand
http://dbpedia.org/resource/Adventures_in_Stereo

If I continue to let it run I start seeing this error as well
com.hp.hpl.jena.tdb.TDBException: No known block type for 4
        at com.hp.hpl.jena.tdb.base.block.BlockType.extract(BlockType.java:64)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.getType(BPTreeNodeMgr.java:166)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$200(BPTreeNodeMgr.java:22)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:136)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
        at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
        at 
com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
        at com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
        at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
        at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
        at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
        at 
com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)

Aside from shipping you my laptop is there anything I can provide you
with to help track down the issue? I am comfortable building tdb from
source and setting conditional breakpoints while debugging if that can
be of any benefit.

Thanks for your help.
-jp

On Tue, Jun 28, 2011 at 7:17 AM, Andy Seaborne
<[email protected]> wrote:
> Hi there,
>
> I now have an SSD (256G from Crucial) :-)
>
> /dev/sdb1 on /mnt/ssd1 type ext4 (rw,noatime)
>
> and I ran the test program on jamendo-rdf and on
> mappingbased_properties_en.nt, then on jamendo-rdf with existing data as in
> the test case.
>
> Everything works for me - the loads complete without an exception.
>
>        Andy
>
> On 21/06/11 09:10, Andy Seaborne wrote:
>>
>>
>> On 21/06/11 06:01, jp wrote:
>>>
>>> Hey Andy
>>>
>>> I wasn't able to unzip the file
>>
>>> http://people.apache.org/~andy/jamendo.nt.gz however I ran it on my
>>> dataset and I received an out of memory exception. I then changed line
>>> 42 to true and received the original error. You can download the data
>>> file I have been testing with from
>>> http://www.kosmyna.com/mappingbased_properties_en.nt.bz2 unzipped it's
>>> 2.6gb. This file has consistently failed to load.
>>
>> downloads.dbpedia.org is back - I download that file and loaded it with
>> the test program - no problems.
>>
>>> While trying other datasets and variations of the simple program I had
>>> what seemed to be a successful BulkLoad however when I opened the
>>> dataset and tried to query it there were no results. I don't have the
>>> exact details of this run but can try to reproduce it if you think it
>>> would be useful.
>>
>> Yes please. At this point, any details a help
>>
>> Also, a complete log of the failed load of
>> mappingbased_properties_en.nt.bz2 would be useful.
>>
>> Having looked at the stacktraces, and aligned them to the source code,
>> it appears the code passes an internal consistency check, then fails on
>> something that the test tests for.
>>
>> Andy
>>
>>>
>>> -jp
>>>
>>>
>>> On Mon, Jun 20, 2011 at 4:57 PM, Andy Seaborne
>>> <[email protected]> wrote:
>>>>
>>>> Fixed - sorry about that.
>>>>
>>>> Andy
>>>>
>>>> On 20/06/11 21:50, jp wrote:
>>>>>
>>>>> Hey andy,
>>>>>
>>>>> I assume the file you want me to run is
>>>>> http://people.apache.org/~andy/ReportLoadOnSSD.java
>>>>>
>>>>> When I try to download it I get a permissions error. Let me know when
>>>>> I should try again.
>>>>>
>>>>> -jp
>>>>>
>>>>> On Mon, Jun 20, 2011 at 3:30 PM, Andy Seaborne
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I tried to recreate this but couldn't, but I don't have an SSD to
>>>>>> hand at
>>>>>> the moment (being fixed :-)
>>>>>>
>>>>>> I've put my test program and the data from the jamendo-rdf you sent me
>>>>>> in:
>>>>>>
>>>>>> http://people.apache.org/~andy/
>>>>>>
>>>>>> so we can agree on exactly a test case. This code is single threaded.
>>>>>>
>>>>>> The conversion from .rdf to .nt wasn't pure.
>>>>>>
>>>>>> I tried running using the in-memory store as well.
>>>>>> downloads.dbpedia.org was down atthe weekend - I'll try to get the
>>>>>> same
>>>>>> dbpedia data.
>>>>>>
>>>>>> Could you run exactly what I was running? The file name needs
>>>>>> changing.
>>>>>>
>>>>>> You can also try uncommenting
>>>>>> SystemTDB.setFileMode(FileMode.direct) ;
>>>>>> and run it using non-mapped files in about 1.2 G of heap.
>>>>>>
>>>>>> Looking through the stacktarce, there is a point where the code has
>>>>>> passed
>>>>>> an internal consistence test then fails with something that should be
>>>>>> caught
>>>>>> by that test - and the code is sync'ed or single threaded. This is, to
>>>>>> put
>>>>>> it mildly, worrying.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 18/06/11 16:38, jp wrote:
>>>>>>>
>>>>>>> Hey Andy,
>>>>>>>
>>>>>>> My entire program is run on one jvm as follows.
>>>>>>>
>>>>>>> public static void main(String[] args) throws IOException{
>>>>>>> DatasetGraphTDB datasetGraph = TDBFactory.createDatasetGraph(tdbDir);
>>>>>>>
>>>>>>> /* I saw the BulkLoader had two ways of loading data based on whether
>>>>>>> the dataset existed already. I did two runs one with the following
>>>>>>> two
>>>>>>> lines commented out to test both ways the BulkLoader runs. Hopefully
>>>>>>> this had the desired effect. */
>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>> Node.createURI("urn:house")));
>>>>>>> datasetGraph.sync();
>>>>>>>
>>>>>>> InputStream inputStream = new FileInputStream(dbpediaData);
>>>>>>>
>>>>>>> BulkLoader bulkLoader = new BulkLoader();
>>>>>>> bulkLoader.loadDataset(datasetGraph, inputStream, true);
>>>>>>> }
>>>>>>>
>>>>>>> The data can be found here
>>>>>>> http://downloads.dbpedia.org/3.6/en/mappingbased_properties_en.nt.bz2
>>>>>>> I appended the ontology to end of file it can be found here
>>>>>>> http://downloads.dbpedia.org/3.6/dbpedia_3.6.owl.bz2
>>>>>>>
>>>>>>> The tdbDir is an empty directory.
>>>>>>> On my system the error starts occurring after about 2-3minutes and
>>>>>>> 8-12 million triples loaded.
>>>>>>>
>>>>>>> Thanks for looking over this and please let me know if I can be of
>>>>>>> further assistance.
>>>>>>>
>>>>>>> -jp
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On Jun 17, 2011 9:29 am, andy wrote:
>>>>>>>>
>>>>>>>> jp,
>>>>>>>>
>>>>>>>> How does this fit with running:
>>>>>>>>
>>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>>> Node.createURI("urn:house")));
>>>>>>>> datasetGraph.sync();
>>>>>>>>
>>>>>>>> Is the preload of one triple a separate JVM or the same JVM as the
>>>>>>>> BulkLoader call - could you provide a single complete minimal
>>>>>>>> example?
>>>>>>>>
>>>>>>>> In attempting to reconstruct this, I don't want to hide the
>>>>>>>> problem by
>>>>>>>> guessing how things are wired together.
>>>>>>>>
>>>>>>>> Also - exactly which dbpedia file are you loading (URL?) although I
>>>>>>>> doubt the exact data is the cause here.
>>>>>>
>>>>
>

Re: BulkLoader error with large data and fast harddrive

Reply via email to