Hi, Processing time mainly depends on the speed of the hard disc. On my MacBook pro with SSD I usually see processing times of around 1ms/items. Doing the same on a Laptop hard disc you can see processing times of 100ms/items. the ram and CPU power dose not have a big influence of the process.
other tips: * usually the import of the RDF data to Jena TDB takes as long as the actual indexing process. RDF import speed can be greatly improved by assigning more RAM (-xmx) to the JVM. As long a Jena can hold the RDF data in memory (memory mapped files) the import will be around 100k triples/sec. If it needs to access a SSD you should see speeds of about 10k triples/sec. On normal hard discs import speed will drop even further. * It is a good advice to kill the Indexing Tool after it has imported all triples to Jena TDB. This ensures that all the RAM still occupied by the Jena TDB memory mapped files is freed and can be better used for speed up the Entity lookup during the indexing process. * If you do not plan to use all properties included in the imported RDF files you should configure the mapping.txt file accordingly. This can dramatically reduce the size of the resulting Solr index and may speed up indexing time and also improve the performance as usage time. * interlanguage_links_en.nt.bz2 do contain a lot of RDF statements. Do you really need those? I usually only include this file if I want to interconnect information from different dbpedia language versions (e.g. copy over the rdf:type statements of the english dbpedia to the german language dbpedia dump. In this case I import the english and german types and use the information of the inter language links to write an LDPath statement. hope this helps best Rupert On Fri, May 24, 2013 at 4:32 PM, Manish Aggarwal <[email protected]> wrote: > Hi, > > I am trying to index 16 dbpedia files (names given below) using the dbpedia > indexing tool provided with entityhub. > > The process is running from last 3 days but not finishing up. I am using a > machine with 16GB ram with dual cores. Is there some way to know how much > time should teh whole process takes on a particular machines? > Also, is there any tricks to speed up the process (like run somethings in > paralled)? > > > Dbpedia files on which I am running the indexing process are: > > article_categories_en.nt.bz2 > infobox_property_definitions_en.nt.bz2 pnd_en.nt.bz2 > category_labels_en.nt.bz2 instance_types_en.nt.bz2 > redirects_en.nt.bz2 > dbpedia_3.8.owl.bz2 interlanguage_links_en.nt.bz2 > skos_categories_en.nt.bz2 > geo_coordinates_en.nt.bz2 long_abstracts_en.nt.bz2 > specific_mappingbased_properties_en.nt.bz2 > homepages_en.nt.bz2 > mappingbased_properties_en.nt.bz2 > > infobox_properties_en.nt.bz2 persondata_en.nt.bz2 > > Regards, > Manish -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
