Hi,

Processing time mainly depends on the speed of the hard disc. On my
MacBook pro with SSD I usually see processing times of around
1ms/items. Doing the same on a Laptop hard disc you can see processing
times of 100ms/items. the ram and CPU power dose not have a big
influence of the process.

other tips:

* usually the import of the RDF data to Jena TDB takes as long as the
actual indexing process. RDF import speed can be greatly improved by
assigning more RAM (-xmx) to the JVM. As long a Jena can hold the RDF
data in memory (memory mapped files) the import will be around 100k
triples/sec. If it needs to access a SSD you should see speeds of
about 10k triples/sec. On normal hard discs import speed will drop
even further.

* It is a good advice to kill the Indexing Tool after it has imported
all triples to Jena TDB. This ensures that all the RAM still occupied
by the Jena TDB memory mapped files is freed and can be better used
for speed up the Entity lookup during the indexing process.

* If you do not plan to use all properties included in the imported
RDF files you should configure the mapping.txt file accordingly. This
can dramatically reduce the size of the resulting Solr index and may
speed up indexing time and also improve the performance as usage time.

* interlanguage_links_en.nt.bz2 do contain a lot of RDF statements. Do
you really need those? I usually only include this file if I want to
interconnect information from different dbpedia language versions
(e.g. copy over the rdf:type statements of the english dbpedia to the
german language dbpedia dump. In this case I import the english and
german types and use the information of the inter language links to
write an LDPath statement.

hope this helps
best
Rupert


On Fri, May 24, 2013 at 4:32 PM, Manish Aggarwal <[email protected]> wrote:
> Hi,
>
> I am trying to index 16 dbpedia files (names given below) using the dbpedia
> indexing tool provided with entityhub.
>
> The process is running from last 3 days but not finishing up. I am using a
> machine with 16GB ram with dual cores. Is there some way to know how much
> time should teh whole process takes on a particular machines?
> Also, is there any tricks to speed up the process (like run somethings in
> paralled)?
>
>
> Dbpedia files on which I am running the indexing process are:
>
> article_categories_en.nt.bz2            
> infobox_property_definitions_en.nt.bz2          pnd_en.nt.bz2
> category_labels_en.nt.bz2                       instance_types_en.nt.bz2      
>                   redirects_en.nt.bz2
> dbpedia_3.8.owl.bz2                             interlanguage_links_en.nt.bz2 
>                   skos_categories_en.nt.bz2
> geo_coordinates_en.nt.bz2                       long_abstracts_en.nt.bz2      
>                   specific_mappingbased_properties_en.nt.bz2
> homepages_en.nt.bz2                             
> mappingbased_properties_en.nt.bz2
>
> infobox_properties_en.nt.bz2 persondata_en.nt.bz2
>
> Regards,
> Manish



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to