Re: BulkLoader error with large data and fast harddrive

jp Tue, 28 Jun 2011 16:36:39 -0700

tdbloader2 was able to load the file. Log can be found here
http://www.kosmyna.com/tdbloader2.log
I guess the questions now are What's the difference between tdbloader2
and the test application? and Why does tdbloader fail?


-jp

On Tue, Jun 28, 2011 at 7:21 PM, jp <[email protected]> wrote:
> Sorry for any confusion tdbloader2 is working find I had a typo in my
> $PATH variable. I'll post results of the load asap.
>
> -jp
>
> On Tue, Jun 28, 2011 at 7:02 PM, jp <[email protected]> wrote:
>> The complete log file is over 13gb. I have posted the first 5000 lines
>> here http://www.kosmyna.com/ReportLoadOnSSD.log.5000lines
>> The run of tdbloader failed as well. first 5000 lines can be found
>> here http://www.kosmyna.com/tdbloader.log.5000lines
>>
>> I could not run tdbloader2 I get the following error
>> ./tdbloader2: line 14: make_classpath: command not found
>>
>> I have TDBROOT environment variable correctly set and am using this
>> version of tdb
>> http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/tags/TDB-0.8.10/bin
>>
>> -jp
>>
>>
>> On Tue, Jun 28, 2011 at 4:30 PM, Andy Seaborne
>> <[email protected]> wrote:
>>>> Aside from shipping you my laptop is there anything I can provide you
>>>> with to help track down the issue?
>>>
>>> A complete log, with the exception would help to identify the point where it
>>> fails.  Its a possible clue.
>>>
>>> Could you also try running tdbloader and tdbloader2 to bulk load the files?
>>>
>>>        Andy
>>>
>>>
>>> On 28/06/11 21:19, jp wrote:
>>>>
>>>> Hey Andy,
>>>>
>>>> Saw the twitter message 29% load speed increase is pretty nice. Glad I
>>>> could give you the excuse to upgrade :) Though It worries me that you
>>>> don't receive the same exception I do. I consistently have loading
>>>> issues using the file posted at
>>>> http://www.kosmyna.com/mappingbased_properties_en.nt.bz2. I can get
>>>> the test program to complete by making the following changes but it's
>>>> slow (30 minutes).
>>>>
>>>> SystemTDB.setFileMode(FileMode.direct) ;
>>>>
>>>>  if ( true ) {
>>>>      String dir = "/home/jp/scratch/ssdtest/DB-X" ;
>>>>      FileOps.clearDirectory(dir) ;
>>>>      datasetGraph = TDBFactory.createDatasetGraph(dir);
>>>>  }
>>>>
>>>> Running the program with the sections of code below fails every time.
>>>>
>>>> //SystemTDB.setFileMode(FileMode.direct) ;
>>>>
>>>>  if ( true ) {
>>>>      String dir = "/home/jp/scratch/ssdtest/DB-X" ;
>>>>      FileOps.clearDirectory(dir) ;
>>>>      datasetGraph = TDBFactory.createDatasetGraph(dir);
>>>>  }
>>>>
>>>> The exception:
>>>> java.lang.IllegalArgumentException
>>>>        at java.nio.Buffer.position(Buffer.java:235)
>>>>        at
>>>> com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:94)
>>>>        at
>>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:95)
>>>>        at
>>>> com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:41)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeRecords.getSplitKey(BPTreeRecords.java:141)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.split(BPTreeNode.java:435)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:387)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
>>>>        at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
>>>>        at
>>>> com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
>>>>        at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
>>>>        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
>>>>        at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
>>>>        at
>>>> com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)
>>>> http://dbpedia.org/resource/Spirea_X
>>>> http://dbpedia.org/ontology/associatedBand
>>>> http://dbpedia.org/resource/Adventures_in_Stereo
>>>>
>>>> If I continue to let it run I start seeing this error as well
>>>> com.hp.hpl.jena.tdb.TDBException: No known block type for 4
>>>>        at
>>>> com.hp.hpl.jena.tdb.base.block.BlockType.extract(BlockType.java:64)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.getType(BPTreeNodeMgr.java:166)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$200(BPTreeNodeMgr.java:22)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:136)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.TupleIndexRecord.performAdd(TupleIndexRecord.java:48)
>>>>        at
>>>> com.hp.hpl.jena.tdb.index.TupleIndexBase.add(TupleIndexBase.java:49)
>>>>        at com.hp.hpl.jena.tdb.index.TupleTable.add(TupleTable.java:54)
>>>>        at
>>>> com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:77)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:268)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$2.send(BulkLoader.java:244)
>>>>        at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:60)
>>>>        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
>>>>        at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:122)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:159)
>>>>        at
>>>> com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:117)
>>>>        at
>>>> com.nimblegraph.data.bin.ReportLoadOnSSD.main(ReportLoadOnSSD.java:68)
>>>>
>>>> Aside from shipping you my laptop is there anything I can provide you
>>>> with to help track down the issue? I am comfortable building tdb from
>>>> source and setting conditional breakpoints while debugging if that can
>>>> be of any benefit.
>>>>
>>>> Thanks for your help.
>>>> -jp
>>>>
>>>> On Tue, Jun 28, 2011 at 7:17 AM, Andy Seaborne
>>>> <[email protected]>  wrote:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> I now have an SSD (256G from Crucial) :-)
>>>>>
>>>>> /dev/sdb1 on /mnt/ssd1 type ext4 (rw,noatime)
>>>>>
>>>>> and I ran the test program on jamendo-rdf and on
>>>>> mappingbased_properties_en.nt, then on jamendo-rdf with existing data as
>>>>> in
>>>>> the test case.
>>>>>
>>>>> Everything works for me - the loads complete without an exception.
>>>>>
>>>>>        Andy
>>>>>
>>>>> On 21/06/11 09:10, Andy Seaborne wrote:
>>>>>>
>>>>>>
>>>>>> On 21/06/11 06:01, jp wrote:
>>>>>>>
>>>>>>> Hey Andy
>>>>>>>
>>>>>>> I wasn't able to unzip the file
>>>>>>
>>>>>>> http://people.apache.org/~andy/jamendo.nt.gz however I ran it on my
>>>>>>> dataset and I received an out of memory exception. I then changed line
>>>>>>> 42 to true and received the original error. You can download the data
>>>>>>> file I have been testing with from
>>>>>>> http://www.kosmyna.com/mappingbased_properties_en.nt.bz2 unzipped it's
>>>>>>> 2.6gb. This file has consistently failed to load.
>>>>>>
>>>>>> downloads.dbpedia.org is back - I download that file and loaded it with
>>>>>> the test program - no problems.
>>>>>>
>>>>>>> While trying other datasets and variations of the simple program I had
>>>>>>> what seemed to be a successful BulkLoad however when I opened the
>>>>>>> dataset and tried to query it there were no results. I don't have the
>>>>>>> exact details of this run but can try to reproduce it if you think it
>>>>>>> would be useful.
>>>>>>
>>>>>> Yes please. At this point, any details a help
>>>>>>
>>>>>> Also, a complete log of the failed load of
>>>>>> mappingbased_properties_en.nt.bz2 would be useful.
>>>>>>
>>>>>> Having looked at the stacktraces, and aligned them to the source code,
>>>>>> it appears the code passes an internal consistency check, then fails on
>>>>>> something that the test tests for.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>>
>>>>>>> -jp
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 20, 2011 at 4:57 PM, Andy Seaborne
>>>>>>> <[email protected]>  wrote:
>>>>>>>>
>>>>>>>> Fixed - sorry about that.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On 20/06/11 21:50, jp wrote:
>>>>>>>>>
>>>>>>>>> Hey andy,
>>>>>>>>>
>>>>>>>>> I assume the file you want me to run is
>>>>>>>>> http://people.apache.org/~andy/ReportLoadOnSSD.java
>>>>>>>>>
>>>>>>>>> When I try to download it I get a permissions error. Let me know when
>>>>>>>>> I should try again.
>>>>>>>>>
>>>>>>>>> -jp
>>>>>>>>>
>>>>>>>>> On Mon, Jun 20, 2011 at 3:30 PM, Andy Seaborne
>>>>>>>>> <[email protected]>  wrote:
>>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> I tried to recreate this but couldn't, but I don't have an SSD to
>>>>>>>>>> hand at
>>>>>>>>>> the moment (being fixed :-)
>>>>>>>>>>
>>>>>>>>>> I've put my test program and the data from the jamendo-rdf you sent
>>>>>>>>>> me
>>>>>>>>>> in:
>>>>>>>>>>
>>>>>>>>>> http://people.apache.org/~andy/
>>>>>>>>>>
>>>>>>>>>> so we can agree on exactly a test case. This code is single
>>>>>>>>>> threaded.
>>>>>>>>>>
>>>>>>>>>> The conversion from .rdf to .nt wasn't pure.
>>>>>>>>>>
>>>>>>>>>> I tried running using the in-memory store as well.
>>>>>>>>>> downloads.dbpedia.org was down atthe weekend - I'll try to get the
>>>>>>>>>> same
>>>>>>>>>> dbpedia data.
>>>>>>>>>>
>>>>>>>>>> Could you run exactly what I was running? The file name needs
>>>>>>>>>> changing.
>>>>>>>>>>
>>>>>>>>>> You can also try uncommenting
>>>>>>>>>> SystemTDB.setFileMode(FileMode.direct) ;
>>>>>>>>>> and run it using non-mapped files in about 1.2 G of heap.
>>>>>>>>>>
>>>>>>>>>> Looking through the stacktarce, there is a point where the code has
>>>>>>>>>> passed
>>>>>>>>>> an internal consistence test then fails with something that should
>>>>>>>>>> be
>>>>>>>>>> caught
>>>>>>>>>> by that test - and the code is sync'ed or single threaded. This is,
>>>>>>>>>> to
>>>>>>>>>> put
>>>>>>>>>> it mildly, worrying.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On 18/06/11 16:38, jp wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hey Andy,
>>>>>>>>>>>
>>>>>>>>>>> My entire program is run on one jvm as follows.
>>>>>>>>>>>
>>>>>>>>>>> public static void main(String[] args) throws IOException{
>>>>>>>>>>> DatasetGraphTDB datasetGraph =
>>>>>>>>>>> TDBFactory.createDatasetGraph(tdbDir);
>>>>>>>>>>>
>>>>>>>>>>> /* I saw the BulkLoader had two ways of loading data based on
>>>>>>>>>>> whether
>>>>>>>>>>> the dataset existed already. I did two runs one with the following
>>>>>>>>>>> two
>>>>>>>>>>> lines commented out to test both ways the BulkLoader runs.
>>>>>>>>>>> Hopefully
>>>>>>>>>>> this had the desired effect. */
>>>>>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>>>>>> Node.createURI("urn:house")));
>>>>>>>>>>> datasetGraph.sync();
>>>>>>>>>>>
>>>>>>>>>>> InputStream inputStream = new FileInputStream(dbpediaData);
>>>>>>>>>>>
>>>>>>>>>>> BulkLoader bulkLoader = new BulkLoader();
>>>>>>>>>>> bulkLoader.loadDataset(datasetGraph, inputStream, true);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> The data can be found here
>>>>>>>>>>>
>>>>>>>>>>> http://downloads.dbpedia.org/3.6/en/mappingbased_properties_en.nt.bz2
>>>>>>>>>>> I appended the ontology to end of file it can be found here
>>>>>>>>>>> http://downloads.dbpedia.org/3.6/dbpedia_3.6.owl.bz2
>>>>>>>>>>>
>>>>>>>>>>> The tdbDir is an empty directory.
>>>>>>>>>>> On my system the error starts occurring after about 2-3minutes and
>>>>>>>>>>> 8-12 million triples loaded.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for looking over this and please let me know if I can be of
>>>>>>>>>>> further assistance.
>>>>>>>>>>>
>>>>>>>>>>> -jp
>>>>>>>>>>> [email protected]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jun 17, 2011 9:29 am, andy wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> jp,
>>>>>>>>>>>>
>>>>>>>>>>>> How does this fit with running:
>>>>>>>>>>>>
>>>>>>>>>>>> datasetGraph.getDefaultGraph().add(new
>>>>>>>>>>>> Triple(Node.createURI("urn:hello"), RDF.type.asNode(),
>>>>>>>>>>>> Node.createURI("urn:house")));
>>>>>>>>>>>> datasetGraph.sync();
>>>>>>>>>>>>
>>>>>>>>>>>> Is the preload of one triple a separate JVM or the same JVM as the
>>>>>>>>>>>> BulkLoader call - could you provide a single complete minimal
>>>>>>>>>>>> example?
>>>>>>>>>>>>
>>>>>>>>>>>> In attempting to reconstruct this, I don't want to hide the
>>>>>>>>>>>> problem by
>>>>>>>>>>>> guessing how things are wired together.
>>>>>>>>>>>>
>>>>>>>>>>>> Also - exactly which dbpedia file are you loading (URL?) although
>>>>>>>>>>>> I
>>>>>>>>>>>> doubt the exact data is the cause here.
>>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>
>

Re: BulkLoader error with large data and fast harddrive

Reply via email to