Hi Rupert, Finally, I got the freebase index after 2 days run. For english language only, the size is roughly 28G.
Surprisingly, after I installed it via OSGI console it created Referenced Site and Solr Yard. However, it's not visible within entityhub sites. I did configure following parameters within SolrYard a. "Allow Initialization" - checked b. Index configuration: freebase.solrindex.zip I also re-started couple times but no luck. Does it require any additional special configuration? i.e. do I need to have higher -Xmx parameter setting or something else With best regards, Rajan On Tue, May 26, 2015 at 9:06 AM, <raja...@gmail.com> wrote: > Hi, > > Accidentally, I wiped out logs for a clean start. At the same time, I am > planning to run on a higher end AWS instance as well, so will keep you > posted. > > Thanks again for your continuous help. > > With best regards, > Rajan > > Sent from my iPhone > > > On May 26, 2015, at 8:47 AM, Rupert Westenthaler < > rupert.westentha...@gmail.com> wrote: > > > > HI > > > >> On Tue, May 26, 2015 at 2:13 PM, <raja...@gmail.com> wrote: > >> Hi Rupert, > >> > >> After last failure, I am only using language=en and it still fails. > > > > Can you provide the some lines of logging before the OOM. I would like > > to be sure that it really happens during the Solr optimization phase. > > > >> Thanks for the timely answer. Just to double confirm, if I re-started > the index command this am again with higher -Xmx option is it too late to > run finalise correct? > > > > If the OOM exception really happened during the Solr optimization calling > > > > java -jar -Xmx{higher-value}g > > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > > finalise > > > > will use the data of the previous indexing call and just repeat the > > finalization steps > > > > best > > Rupert > > > > > >> With best regards, > >> Rajan > >> > >> Sent from my iPhone > >> > >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler < > rupert.westentha...@gmail.com> wrote: > >>> > >>> Hi Rajan > >>> > >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <raja...@gmail.com> > wrote: > >>>> Hi Rupert, > >>>> > >>>> Thanks for the reply. > >>>> > >>>> As per your suggestion, I made necessary changes however it failed > with > >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however > at this > >>>> point it's a trial and error approach with several days effort being > >>>> wasted. > >>> > >>> I guess you are getting the OutOfMemory while optimizing the Solr > >>> Index (right?). The README [1] explicitly notes that a high amount of > >>> memory is needed by exactly this step of the indexing process. > >>> > >>> If the indexing fails at this step you can call the indexing tool with > >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2] > >>> for details). This will prevent the indexing to be repeated and only > >>> execute the finalization steps (optimizing the Solr Index and creating > >>> the freebase.solrindex.zip file). > >>> > >>> > >>>> I am just throwing out an idea, but wanted to see > >>>> > >>>> a. Is it possible to publish set of constraints and required > parameters. > >>>> i.e. with minimal set of entities within mappings.txt, one need to set > >>>> these parameters? > >>> > >>> I do not understand this question. Do you want to filter entities > >>> based on their information? If so you might want to have a look at the > >>> > `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`. > >>> The generic RDF indexing tool as an example on how to use this > >>> processor to filter entities based on their rdf:type values. > >>> > >>> See also the "Entity Filters" section of [3] > >>> > >>>> > >>>> b. Is it possible to split the file based on subject? generate smaller > >>>> index for each subject and merge afterwards? > >>> > >>> Yes. You can split up the dump (by subject). Import those parts in > >>> different Indexing Tool instances (meaning different Jena TDB > >>> instances). Importing 4*500million triples to Jena TDB is supposed to > >>> be much faster as 1*2Billion. > >>> > >>> If you still want to have all data in a single Entityhub Site you need > >>> to script the indexing process. > >>> > >>> * call indexing for the first part > >>> * after this finishes link the {part1}/indexing/destination/indexes > >>> folder to {part2..n}/indexing/destination/indexes > >>> * call indexing for the 2..n parts. > >>> > >>> As the indexing tool only adds additional information to the Solr > >>> Index you will get the union over all parts at the end of the process. > >>> All parts need to use the full incoming_links.txt file because > >>> otherwise the rankings would not be correct. > >>> > >>> The "Indexing Datasets separately" section of [3] describes a similar > >>> trick of creating an union index over multiple datasets. > >>> > >>> > >>> best > >>> Rupert > >>> > >>>> c. Work with BaseKB guys to also make it available at nominal charge? > >>>> > >>>> d. Maybe apply some Map/Reduce - extension of idea b > >>>> > >>>> With best regards, > >>>> Rajan > >>> > >>> > >>> > >>> [1] > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md > >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047 > >>> [3] > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md > >>> > >>>> > >>>> > >>>> > >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler < > >>>> rupert.westentha...@gmail.com> wrote: > >>>> > >>>>> Hi Rajan, > >>>>> > >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): > >>>>> > >>>>> 'You have not indexed a single entity. So something in your indexing > >>>>> configuration is wrong. Most likely you are not correctly building > the > >>>>> URIs of the entities from the incoming_links.txt file. Can you > provide > >>>>> me an example line of the 'incoming_links.txt' file and the contents > >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs > are > >>>>> built. > >>>>> > >>>>> Short answers to the other questions > >>>>> > >>>>> > >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> > wrote: > >>>>>> it ran for almost 3 days and generated index. > >>>>> > >>>>> Thats good. It means you do have now the Freebase dump in your Jena > >>>>> TDB triple store. You will not need to repeat this (until you want to > >>>>> use a newer dump. On the next call to the indexing tool it will > >>>>> immediately start with the indexing step. > >>>>> > >>>>> > >>>>>> > >>>>>> Couple questions come to mind: > >>>>>> > >>>>>> a. Is there any particular log/error file the process generates > besides > >>>>>> printing out on stdout/stderr? > >>>>> > >>>>> The indexer writes a zip archive with the IDs of all the indexed > >>>>> entities. Its in the indexing/destination folder. > >>>>> > >>>>>> b. Is it a must-have to have stanbol full launcher running all the > time > >>>>>> while indexing is going on? > >>>>> > >>>>> No Stanbol instance is needed by the indexing process. > >>>>> > >>>>>> c. Is it possible that, if the machine is not connected to internet > for > >>>>>> couple minutes could cause some issues? > >>>>> > >>>>> No Internet connectivity is needed during indexing. Only if you want > >>>>> to use the namespace prefix mappings of prefix.cc you need to have > >>>>> internet connectivity when starting the indexing tool. > >>>>> > >>>>> best > >>>>> Rupert > >>>>> > >>>>>> > >>>>>> I would really appreciate, if you can shed some light on "what > could be > >>>>>> wrong" or "potential approach to nail down this issue"? If you > need, I am > >>>>>> happy to share any additional logs/properties. > >>>>>> > >>>>>> With best regards, > >>>>>> Rajan > >>>>>> > >>>>>> *1. Configuration changes* > >>>>>> > >>>>>> a. set ns-prefix-state=false* > >>>>>> [within /indexing/config/iditerator.properties]* > >>>>>> b. add empty space mapping to http://rdf.freebase.com/ns/* > >>>>>> [within namespaceprefix.mappings]* > >>>>>> c. enable bunch of properties within mappings.txt such as following > >>>>>> > >>>>>> fb:music.artist.genre > >>>>>> fb:music.artist.label > >>>>>> fb:music.artist.album > >>>>>> > >>>>>> *2. Contents of indexing/dist directory* > >>>>>> > >>>>>> -rw-r--r-- 108899 May 22 05:11 freebase.solrindex.zip > >>>>>> -rw-r--r-- 3457 May 22 05:11 > >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar > >>>>>> > >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory* > >>>>>> > >>>>>> -rw-r--r-- 1 31026810858 May 20 07:32 freebase.nt.gz > >>>>>> > >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory* > >>>>>> > >>>>>> -rw-r--r-- 1 1206745360 May 19 09:38 incoming_links.txt > >>>>>> > >>>>>> *5. The indexer log* > >>>>>> > >>>>>> *04:31:57,236 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: > >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)* > >>>>>> *04:32:00,727 [Thread-3] INFO jenatdb.RdfResourceImporter - > Filtered: > >>>>>> 2429800000 triples (80.97554853864854%)* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > Finish > >>>>>> triples data phase* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > Data: > >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per > >>>>>> second]* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > Start > >>>>>> triples index phase* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > Finish > >>>>>> triples index phase* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > Finish > >>>>>> triples load* > >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >>>>> Completed: > >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per > >>>>>> second]* > >>>>>> 04:32:56,880 [Thread-3] INFO source.ResourceLoader - ... moving > >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz > >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - - > completed in > >>>>>> 157675 seconds > >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - > loading > >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ... > >>>>>> 04:32:56,944 [Thread-3] WARN jenatdb.RdfResourceImporter - ignore > File > >>>>> {} > >>>>>> because of unknown extension > >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - - > completed in 0 > >>>>>> seconds > >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 2 files > >>>>> imported > >>>>>> in 157675 seconds > >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - Loding 0 File > ... > >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 0 files > >>>>> imported > >>>>>> in 0 seconds > >>>>>> 04:32:56,971 [main] INFO impl.IndexerImpl - ... delete existing > >>>>>> IndexedEntityId file > >>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip > >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Initialisation > completed > >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - ... initialisation > >>>>> completed > >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - start indexing ... > >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Indexing started ... > >>>>>> > >>>>>> > >>>>>> > >>>>>> 04:45:48,075 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' > >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored! > >>>>>> 04:45:48,076 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' > >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored! > >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' > >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored! > >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' > >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! > >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace ' > >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored! > >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' > >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' > >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping > ignored! > >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' > >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping > ignored! > >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > impl.NamespacePrefixProviderImpl - > >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' > >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! > >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO > >>>>> impl.IndexerImpl > >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ... > >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >>>>> impl.IndexerImpl > >>>>>> - > current sequence : 0 > >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >>>>> impl.IndexerImpl > >>>>>> - > new sequence: 1 > >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >>>>> impl.IndexerImpl > >>>>>> - Send end-of-queue to Deamons with Sequence 1 > >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ... > >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >>>>>>> current sequence : 1 > >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >>>>>>> new sequence: 2 > >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >>>>>> Send end-of-queue to Deamons with Sequence 2 > >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >>>>> impl.IndexerImpl - > >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ... > >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >>>>> impl.IndexerImpl - > >>>>>>> current sequence : 2 > >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >>>>> impl.IndexerImpl - > >>>>>>> new sequence: 3 > >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >>>>> impl.IndexerImpl - > >>>>>> Send end-of-queue to Deamons with Sequence 3 > >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): > >>>>>> processing: -1.000ms/item | queue: -1.000ms* > >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - - source : -1.000ms/item > >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - - processing: -1.000ms/item > >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - - store : -1.000ms/item > >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed > >>>>>> (sequence=3) ... > >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - > current sequence : 3 > >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - > new sequence: 4 > >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4 > >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >>>>> impl.IndexerImpl > >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ... > >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >>>>> impl.IndexerImpl > >>>>>> - > current sequence : 4 > >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... indexing > completed > >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - start post-processing > ... > >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - PostProcessing started > ... > >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... post-processing > >>>>> finished > >>>>>> ... > >>>>>> 05:11:41,911 [main] INFO impl.IndexerImpl - start finalisation.... > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler < > >>>>>> rupert.westentha...@gmail.com> wrote: > >>>>>> > >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com> > wrote: > >>>>>>>> Hi Rupert and Antonio, > >>>>>>>> > >>>>>>>> Thanks a lot for the reply. > >>>>>>>> > >>>>>>>> I start to follow Rupert's suggestion, however it failed again at > >>>>>>>> > >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88] > >>>>>>> illegal > >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be > resolved > >>>>> for > >>>>>>>> the entire file? > >>>>>>> > >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing > errors. > >>>>>>> So the Jena Mailing lists would be the better place to look for > >>>>>>> answers. > >>>>>>> This specific issue looks like an invalid URI that is not fixed by > the > >>>>>>> fixit script. > >>>>>>> > >>>>>>> > >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem > to > >>>>> be > >>>>>>>> open. > >>>>>>>> > >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/ > >>>>>>>> --add-header="x-amz-request-payer: requester" > >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied > >>>>>>>> > >>>>>>>> > >>>>>>>> *Couple additional questions:* > >>>>>>>> > >>>>>>>> *1. indexing enhancements:* > >>>>>>>> What settings/properties one can tweak to gain most out of the > >>>>> indexing. > >>>>>>> > >>>>>>> In general you do only want information as needed for your > application > >>>>>>> case in the index. > >>>>>>> For EntityLinking only labels and type are required. > >>>>>>> Additional properties will only be used for dereferencing > Entities. So > >>>>>>> this will depend on your application needs (your dereferencing > >>>>>>> configuration). > >>>>>>> > >>>>>>> In general I try to exclude as much information as possible form > the > >>>>>>> index to keep the size of the Solr Index as small as possible. > >>>>>>> > >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... > within > >>>>>>>> freebase > >>>>>>>> b. potential optimizations to speed up the overall indexing > >>>>>>> > >>>>>>> Most of the time will be needed to load the Freebase dump into Jena > >>>>>>> TDB. Even with an SSD equipped Server this will take several days. > >>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache > >>>>>>> more things in RAM. > >>>>>>> > >>>>>>> Usually it is a good Idea to cancel the indexing process after the > >>>>>>> importing of the RDF data has finished (and the indexing of the > >>>>>>> Entities has started). This is because after indexing all the RAM > will > >>>>>>> be used by Jena TDB for caching stuff that is no longer needed in > the > >>>>>>> read-only operations during indexing. So a fresh start can speed up > >>>>>>> the indexing part of the process. > >>>>>>> > >>>>>>> Also have a look at the Freebase Indexing Tool Readme > >>>>>>> > >>>>>>>> > >>>>>>>> *2. demo:* > >>>>>>>> I see that, in recent github commit(s) the eHealth and other demos > >>>>> have > >>>>>>>> been commented out. How can I get demo source code and other > >>>>> components > >>>>>>> for > >>>>>>>> these demos. I prefer to build it myself to see the power of > stanbol. > >>>>>>> > >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully > >>>>>>> compatible to the trunk version. > >>>>>>> > >>>>>>>> *3. custom vocabulary:* > >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a > preferred > >>>>> way > >>>>>>>> to upload it to Stanbol and have it recognize my entities? > >>>>>>> > >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to > use > >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version > >>>>>>> but AFAIK this combination is not so stable and might not work at > all. > >>>>>>> > >>>>>>> * Google Refine allows you to import your CSV file. > >>>>>>> * Clean it up (if necessary) > >>>>>>> * The RDF extension allows you to map your CSV data to RDF > >>>>>>> * based on this mapping you can save your data as RDF > >>>>>>> * after that you can import the RDF data to Apache Stanbol > >>>>>>> > >>>>>>> hope this helps > >>>>>>> best > >>>>>>> Rupert > >>>>>>> > >>>>>>>> > >>>>>>>> Thanks in advance, > >>>>>>>> Rajan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> [1] > >>>>> > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/ > >>>>>>> [2] https://code.google.com/p/google-refine/ > >>>>>>> [3] http://refine.deri.ie/ > >>>>>>> [4] http://openrefine.org/ > >>>>>>> > >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler < > >>>>>>>> rupert.westentha...@gmail.com> wrote: > >>>>>>>> > >>>>>>>>> Hi Rajan, > >>>>>>>>> > >>>>>>>>> I think this is because you named you file > >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF > >>>>> format > >>>>>>>>> is not provided by the file extension. Renaming the file to > >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue. > >>>>>>>>> > >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option. > >>>>>>>>> > >>>>>>>>> best > >>>>>>>>> Rupert > >>>>>>>>> > >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales > >>>>>>>>> <ape...@zaizi.com> wrote: > >>>>>>>>>> Hi Rajan > >>>>>>>>>> > >>>>>>>>>> Freebase dump contains some things that does not fit very well > with > >>>>>>> the > >>>>>>>>>> indexer. > >>>>>>>>>> I advise you to use the dump provided by BaseKB ( > http://basekb.com > >>>>> ) > >>>>>>>>> which > >>>>>>>>>> is a curated Freebase dump. > >>>>>>>>>> I did not have any problem indexing it using that dump. > >>>>>>>>>> > >>>>>>>>>> Regards > >>>>>>>>>> > >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com> > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and > >>>>> observed > >>>>>>>>>>> following issue: > >>>>>>>>>>> > >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > >>>>> Element > >>>>>>> or > >>>>>>>>>>> attribute do not match QName production: > >>>>> QName::=(NCName':')?NCName. > >>>>>>>>>>> > >>>>>>>>>>> I would appreciate any help pertaining to this issue. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Rajan > >>>>>>>>>>> > >>>>>>>>>>> *Steps followed:* > >>>>>>>>>>> > >>>>>>>>>>> *1. Initialization: * > >>>>>>>>>>> java -jar > >>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >>>>>>>>>>> init > >>>>>>>>>>> > >>>>>>>>>>> *2. Download the data:* > >>>>>>>>>>> Download data and copy it to > >>>>>>>>> https://developers.google.com/freebase/data > >>>>>>>>>>> > >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh* > >>>>>>>>>>> It generated incoming_links.txt under resources directory as > >>>>> follows > >>>>>>>>>>> > >>>>>>>>>>> 10888430 m.0kpv11 > >>>>>>>>>>> 3741261 m.019h > >>>>>>>>>>> 2667858 m.0775xx5 > >>>>>>>>>>> 2667804 m.0775xvm > >>>>>>>>>>> 1875352 m.01xryvm > >>>>>>>>>>> 1739262 m.05zppz > >>>>>>>>>>> 1369590 m.01xrzlb > >>>>>>>>>>> > >>>>>>>>>>> *4. Performed execution of fixit script* > >>>>>>>>>>> > >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} > >>>>>>>>>>> > >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it * > >>>>>>>>>>> to indexing/resources/rdfdata > >>>>>>>>>>> > >>>>>>>>>>> *6. config/iditer.properties file has following setting* > >>>>>>>>>>> #id-namespace=http://freebase.com/ > >>>>>>>>>>> ns-prefix-state=false > >>>>>>>>>>> > >>>>>>>>>>> *7. Performed run of following command:* > >>>>>>>>>>> java -jar -Xmx32g > >>>>>>>>>>> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >>>>>>> index > >>>>>>>>>>> > >>>>>>>>>>> The error dump on stdout is as follows: > >>>>>>>>>>> > >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO > >>>>> solryard.SolrYardIndexingDestination - > >>>>>>>>> ... > >>>>>>>>>>> copy Solr Configuration form > >>>>>>>>> /private/tmp/freebase/indexing/config/freebase > >>>>>>>>>>> to > >>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase > >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO jenatdb.RdfResourceImporter - > - > >>>>>>> bulk > >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML > >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >>>>> Start > >>>>>>>>>>> triples data phase > >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >>>>> Load > >>>>>>>>> empty > >>>>>>>>>>> triples table > >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > >>>>>>> Element or > >>>>>>>>>>> attribute do not match QName production: > >>>>> QName::=(NCName':')?NCName.* > >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >>>>> Finish > >>>>>>>>>>> triples data phase > >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >>>>> Finish > >>>>>>>>>>> triples load > >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO source.ResourceLoader - Ignore > Error > >>>>>>> for > >>>>>>>>> File > >>>>>>>>>>> > /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz > >>>>> and > >>>>>>>>>>> continue > >>>>>>>>>>> > >>>>>>>>>>> Additional Reference Point: > >>>>>>>>>>> > >>>>>>>>>>> *Original Freebase dump size:* 31025015397 May 14 18:10 > >>>>>>>>>>> freebase-rdf-latest.gz > >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45 > >>>>>>>>>>> freebase-rdf-latest-fixed.gz > >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 > incoming_links.txt > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> ------------------------------ > >>>>>>>>>> This message should be regarded as confidential. If you have > >>>>> received > >>>>>>>>> this > >>>>>>>>>> email in error please notify the sender and destroy it > immediately. > >>>>>>>>>> Statements of intent shall only become binding when confirmed in > >>>>> hard > >>>>>>>>> copy > >>>>>>>>>> by an authorised signatory. > >>>>>>>>>> > >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > registration > >>>>>>> number > >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds > Bush > >>>>>>> Road, > >>>>>>>>>> London W6 7AN. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >>>>>>>>> | Bodenlehenstraße 11 > ++43-699-11108907 > >>>>>>>>> | A-5500 Bischofshofen > >>>>>>>>> | REDLINK.CO > >>>>> > .......................................................................... > >>>>>>>>> | http://redlink.co/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >>>>>>> | Bodenlehenstraße 11 > ++43-699-11108907 > >>>>>>> | A-5500 Bischofshofen > >>>>>>> | REDLINK.CO > >>>>> > .......................................................................... > >>>>>>> | http://redlink.co/ > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >>>>> | Bodenlehenstraße 11 ++43-699-11108907 > >>>>> | A-5500 Bischofshofen > >>>>> | REDLINK.CO > >>>>> > .......................................................................... > >>>>> | http://redlink.co/ > >>> > >>> > >>> > >>> -- > >>> | Rupert Westenthaler rupert.westentha...@gmail.com > >>> | Bodenlehenstraße 11 ++43-699-11108907 > >>> | A-5500 Bischofshofen > >>> | REDLINK.CO > .......................................................................... > >>> | http://redlink.co/ > > > > > > > > -- > > | Rupert Westenthaler rupert.westentha...@gmail.com > > | Bodenlehenstraße 11 ++43-699-11108907 > > | A-5500 Bischofshofen > > | REDLINK.CO > .......................................................................... > > | http://redlink.co/ >