Hi Rupert,
After last mail thread, I performed another fixit run on the dump and then re-started indexing, it ran for almost 3 days and generated index. However, it appears that the freebase index file is extremely small and seems inaccurate. Please find included observations at the end. Couple questions come to mind: a. Is there any particular log/error file the process generates besides printing out on stdout/stderr? b. Is it a must-have to have stanbol full launcher running all the time while indexing is going on? c. Is it possible that, if the machine is not connected to internet for couple minutes could cause some issues? I would really appreciate, if you can shed some light on "what could be wrong" or "potential approach to nail down this issue"? If you need, I am happy to share any additional logs/properties. With best regards, Rajan *1. Configuration changes* a. set ns-prefix-state=false* [within /indexing/config/iditerator.properties]* b. add empty space mapping to http://rdf.freebase.com/ns/* [within namespaceprefix.mappings]* c. enable bunch of properties within mappings.txt such as following fb:music.artist.genre fb:music.artist.label fb:music.artist.album *2. Contents of indexing/dist directory* -rw-r--r-- 108899 May 22 05:11 freebase.solrindex.zip -rw-r--r-- 3457 May 22 05:11 org.apache.stanbol.data.site.freebase-1.0.0.jar *3. Contents of /tmp/freebase/indexing/resources/imported directory* -rw-r--r-- 1 31026810858 May 20 07:32 freebase.nt.gz *4. Contents of /tmp/freebase/indexing/resources directory* -rw-r--r-- 1 1206745360 May 19 09:38 incoming_links.txt *5. The indexer log* *04:31:57,236 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: 570,850,000 triples (Batch: 2,604 / Avg: 3,621)* *04:32:00,727 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: 2429800000 triples (80.97554853864854%)* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish triples data phase* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** Data: 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per second]* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Start triples index phase* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish triples index phase* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish triples load* *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** Completed: 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per second]* 04:32:56,880 [Thread-3] INFO source.ResourceLoader - ... moving imported file freebase.nt.gz to imported/freebase.nt.gz 04:32:56,883 [Thread-3] INFO source.ResourceLoader - - completed in 157675 seconds 04:32:56,883 [Thread-3] INFO source.ResourceLoader - > loading '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ... 04:32:56,944 [Thread-3] WARN jenatdb.RdfResourceImporter - ignore File {} because of unknown extension 04:32:56,958 [Thread-3] INFO source.ResourceLoader - - completed in 0 seconds 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 2 files imported in 157675 seconds 04:32:56,958 [Thread-3] INFO source.ResourceLoader - Loding 0 File ... 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 0 files imported in 0 seconds 04:32:56,971 [main] INFO impl.IndexerImpl - ... delete existing IndexedEntityId file /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip 04:32:56,982 [main] INFO impl.IndexerImpl - Initialisation completed 04:32:56,982 [main] INFO impl.IndexerImpl - ... initialisation completed 04:32:56,982 [main] INFO impl.IndexerImpl - start indexing ... 04:32:56,982 [main] INFO impl.IndexerImpl - Indexing started ... 04:45:48,075 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' http://prefix.cc/nsogi:' invalid -> mapping ignored! 04:45:48,076 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'category' valid , namespace ' http://dbpedia.org/resource/Category:' invalid -> mapping ignored! 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' http://bio2rdf.org/chebi:' invalid -> mapping ignored! 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' http://bio2rdf.org/hgnc:' invalid -> mapping ignored! 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' http://dbpedia.org/resource/Template:' invalid -> mapping ignored! 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' http://dbpedia.org/resource/Category:' invalid -> mapping ignored! 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' http://dbpedia.org/resource/Template:' invalid -> mapping ignored! 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' http://dbpedia.org/resource/Category:' invalid -> mapping ignored! 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'call' valid , namespace ' http://webofcode.org/wfn/call:' invalid -> mapping ignored! 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' http://dbpedia.org/resource/Category:' invalid -> mapping ignored! 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping ignored! 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' http://bg.dbpedia.org/resource/Категория:' invalid -> mapping ignored! 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - Invalid Namespace Mapping: prefix 'condition' valid , namespace ' http://www.kinjal.com/condition:' invalid -> mapping ignored! 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO impl.IndexerImpl - Indexing: Entity Source Reader Deamon completed (sequence=0) ... 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO impl.IndexerImpl - > current sequence : 0 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO impl.IndexerImpl - > new sequence: 1 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 1 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - Indexing: Entity Processor Deamon completed (sequence=1) ... 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - > current sequence : 1 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - > new sequence: 2 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 2 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO impl.IndexerImpl - Indexing: Entity Perstisting Deamon completed (sequence=2) ... 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO impl.IndexerImpl - > current sequence : 2 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO impl.IndexerImpl - > new sequence: 3 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 3 *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): processing: -1.000ms/item | queue: -1.000ms* 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - - source : -1.000ms/item 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - - processing: -1.000ms/item 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - - store : -1.000ms/item 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed (sequence=3) ... 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - > current sequence : 3 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - > new sequence: 4 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO impl.IndexerImpl - Indexer: Entity Error Logging Daemon completed (sequence=4) ... 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO impl.IndexerImpl - > current sequence : 4 05:11:41,910 [main] INFO impl.IndexerImpl - ... indexing completed 05:11:41,910 [main] INFO impl.IndexerImpl - start post-processing ... 05:11:41,910 [main] INFO impl.IndexerImpl - PostProcessing started ... 05:11:41,910 [main] INFO impl.IndexerImpl - ... post-processing finished ... 05:11:41,911 [main] INFO impl.IndexerImpl - start finalisation.... On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler < rupert.westentha...@gmail.com> wrote: > On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com> wrote: > > Hi Rupert and Antonio, > > > > Thanks a lot for the reply. > > > > I start to follow Rupert's suggestion, however it failed again at > > > > 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88] > illegal > > escape sequence value: $ (0x24) -- Is there anyway it can be resolved for > > the entire file? > > > > The indexing tool uses Apache Jena. An those are Jena parsing errors. > So the Jena Mailing lists would be the better place to look for > answers. > This specific issue looks like an invalid URI that is not fixed by the > fixit script. > > > > I requested an access to latest BaseKB bucket, as it doesn't seem to be > > open. > > > > s3cmd ls s3://basekb-now/2015-04-15-18-54/ > > --add-header="x-amz-request-payer: requester" > > ERROR: Access to bucket 'basekb-now' was denied > > > > > > *Couple additional questions:* > > > > *1. indexing enhancements:* > > What settings/properties one can tweak to gain most out of the indexing. > > > > In general you do only want information as needed for your application > case in the index. > For EntityLinking only labels and type are required. > Additional properties will only be used for dereferencing Entities. So > this will depend on your application needs (your dereferencing > configuration). > > In general I try to exclude as much information as possible form the > index to keep the size of the Solr Index as small as possible. > > > a. for ex. domain specific such as Pharmaceutical, Law etc... within > > freebase > > b. potential optimizations to speed up the overall indexing > > Most of the time will be needed to load the Freebase dump into Jena > TDB. Even with an SSD equipped Server this will take several days. > Assigning more RAM will speed up this process as Jena TDB can cache > more things in RAM. > > Usually it is a good Idea to cancel the indexing process after the > importing of the RDF data has finished (and the indexing of the > Entities has started). This is because after indexing all the RAM will > be used by Jena TDB for caching stuff that is no longer needed in the > read-only operations during indexing. So a fresh start can speed up > the indexing part of the process. > > Also have a look at the Freebase Indexing Tool Readme > > > > > *2. demo:* > > I see that, in recent github commit(s) the eHealth and other demos have > > been commented out. How can I get demo source code and other components > for > > these demos. I prefer to build it myself to see the power of stanbol. > > > > The eHealth demo is still in the 0.12 branch [1]. This is fully > compatible to the trunk version. > > > *3. custom vocabulary:* > > Suppose, I have custom vocabulary in CSV format. Is there a preferred way > > to upload it to Stanbol and have it recognize my entities? > > Google Refine[2] with the RDF extension [3]. You can also try to use > the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version > but AFAIK this combination is not so stable and might not work at all. > > * Google Refine allows you to import your CSV file. > * Clean it up (if necessary) > * The RDF extension allows you to map your CSV data to RDF > * based on this mapping you can save your data as RDF > * after that you can import the RDF data to Apache Stanbol > > hope this helps > best > Rupert > > > > > Thanks in advance, > > Rajan > > > > > > [1] > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/ > [2] https://code.google.com/p/google-refine/ > [3] http://refine.deri.ie/ > [4] http://openrefine.org/ > > > On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler < > > rupert.westentha...@gmail.com> wrote: > > > >> Hi Rajan, > >> > >> I think this is because you named you file > >> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF format > >> is not provided by the file extension. Renaming the file to > >> "freebase-rdf-latest-fixed.nt.gz" should fix this issue. > >> > >> The suggestion of Antonio to use BaseKB is also a valid option. > >> > >> best > >> Rupert > >> > >> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales > >> <ape...@zaizi.com> wrote: > >> > Hi Rajan > >> > > >> > Freebase dump contains some things that does not fit very well with > the > >> > indexer. > >> > I advise you to use the dump provided by BaseKB (http://basekb.com) > >> which > >> > is a curated Freebase dump. > >> > I did not have any problem indexing it using that dump. > >> > > >> > Regards > >> > > >> > On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com> > wrote: > >> > > >> >> Hi, > >> >> > >> >> I am working on indexing Freebase data within EntityHub and observed > >> >> following issue: > >> >> > >> >> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] Element > or > >> >> attribute do not match QName production: QName::=(NCName':')?NCName. > >> >> > >> >> I would appreciate any help pertaining to this issue. > >> >> > >> >> Thanks, > >> >> Rajan > >> >> > >> >> *Steps followed:* > >> >> > >> >> *1. Initialization: * > >> >> java -jar > >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> >> init > >> >> > >> >> *2. Download the data:* > >> >> Download data and copy it to > >> https://developers.google.com/freebase/data > >> >> > >> >> *3. Performed execution of fbrankings-uri.sh* > >> >> It generated incoming_links.txt under resources directory as follows > >> >> > >> >> 10888430 m.0kpv11 > >> >> 3741261 m.019h > >> >> 2667858 m.0775xx5 > >> >> 2667804 m.0775xvm > >> >> 1875352 m.01xryvm > >> >> 1739262 m.05zppz > >> >> 1369590 m.01xrzlb > >> >> > >> >> *4. Performed execution of fixit script* > >> >> > >> >> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} > >> >> > >> >> *5. Rename the fixed file to freebase.rdf.gz and copy it * > >> >> to indexing/resources/rdfdata > >> >> > >> >> *6. config/iditer.properties file has following setting* > >> >> #id-namespace=http://freebase.com/ > >> >> ns-prefix-state=false > >> >> > >> >> *7. Performed run of following command:* > >> >> java -jar -Xmx32g > >> >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > index > >> >> > >> >> The error dump on stdout is as follows: > >> >> > >> >> 01:37:32,884 [Thread-0] INFO solryard.SolrYardIndexingDestination - > >> ... > >> >> copy Solr Configuration form > >> /private/tmp/freebase/indexing/config/freebase > >> >> to > /private/tmp/freebase/indexing/destination/indexes/default/freebase > >> >> 01:37:32,895 [Thread-3] INFO jenatdb.RdfResourceImporter - - > bulk > >> >> loading File freebase.rdf.gz using Format Lang:RDF/XML > >> >> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Start > >> >> triples data phase > >> >> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - ** Load > >> empty > >> >> triples table > >> >> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > Element or > >> >> attribute do not match QName production: QName::=(NCName':')?NCName.* > >> >> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish > >> >> triples data phase > >> >> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish > >> >> triples load > >> >> 01:37:32,960 [Thread-3] INFO source.ResourceLoader - Ignore Error > for > >> File > >> >> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz and > >> >> continue > >> >> > >> >> Additional Reference Point: > >> >> > >> >> *Original Freebase dump size:* 31025015397 May 14 18:10 > >> >> freebase-rdf-latest.gz > >> >> *Fixed Freebase dump size:* 31026818367 May 15 12:45 > >> >> freebase-rdf-latest-fixed.gz > >> >> *Incoming Links size: *1206745360 May 17 00:42 incoming_links.txt > >> >> > >> > > >> > -- > >> > > >> > ------------------------------ > >> > This message should be regarded as confidential. If you have received > >> this > >> > email in error please notify the sender and destroy it immediately. > >> > Statements of intent shall only become binding when confirmed in hard > >> copy > >> > by an authorised signatory. > >> > > >> > Zaizi Ltd is registered in England and Wales with the registration > number > >> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush > Road, > >> > London W6 7AN. > >> > >> > >> > >> -- > >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> | REDLINK.CO > >> > .......................................................................... > >> | http://redlink.co/ > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ >