Re: Entityhub indexing for Freebase data

rajansh Tue, 26 May 2015 05:14:20 -0700

Hi Rupert,

After last failure, I am only using language=en and it still fails.


Thanks for the timely answer. Just to double confirm, if I re-started the index 
command this am again with higher -Xmx option is it too late to run finalise 
correct? 

With best regards,
Rajan

Sent from my iPhone

> On May 26, 2015, at 7:47 AM, Rupert Westenthaler 
> <[email protected]> wrote:
> 
> Hi Rajan
> 
>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <[email protected]> wrote:
>> Hi Rupert,
>> 
>> Thanks for the reply.
>> 
>> As per your suggestion, I made necessary changes however it failed with
>> "OutOfMemory" errors. At present, I am running with -Xmx48g however at this
>> point it's a trial and error approach with several days effort being
>> wasted.
> 
> I guess you are getting the OutOfMemory while optimizing the Solr
> Index (right?). The README [1] explicitly notes that a high amount of
> memory is needed by exactly this step of the indexing process.
> 
> If the indexing fails at this step you can call the indexing tool with
> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
> for details). This will prevent the indexing to be repeated and only
> execute the finalization steps (optimizing the Solr Index and creating
> the freebase.solrindex.zip file).
> 
> 
>> I am just throwing out an idea, but wanted to see
>> 
>> a. Is it possible to publish set of constraints and required parameters.
>> i.e. with minimal set of entities within mappings.txt, one need to set
>> these parameters?
> 
> I do not understand this question. Do you want to filter entities
> based on their information? If so you might want to have a look at the
> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
> The generic RDF indexing tool as an example on how to use this
> processor to filter entities based on their rdf:type values.
> 
> See also the "Entity Filters" section of [3]
> 
>> 
>> b. Is it possible to split the file based on subject? generate smaller
>> index for each subject and merge afterwards?
> 
> Yes. You can split up the dump (by subject). Import those parts in
> different Indexing Tool instances (meaning different Jena TDB
> instances). Importing 4*500million triples to Jena TDB is supposed to
> be much faster as 1*2Billion.
> 
> If you still want to have all data in a single Entityhub Site you need
> to script the indexing process.
> 
> * call indexing for the first part
> * after this finishes link the {part1}/indexing/destination/indexes
> folder to {part2..n}/indexing/destination/indexes
> * call indexing for the 2..n parts.
> 
> As the indexing tool only adds additional information to the Solr
> Index you will get the union over all parts at the end of the process.
> All parts need to use the full incoming_links.txt file because
> otherwise the rankings would not be correct.
> 
> The "Indexing Datasets separately" section of [3] describes a similar
> trick of creating an union index over multiple datasets.
> 
> 
> best
> Rupert
> 
>> c. Work with BaseKB guys to also make it available at nominal charge?
>> 
>> d. Maybe apply some Map/Reduce - extension of idea b
>> 
>> With best regards,
>> Rajan
> 
> 
> 
> [1] 
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
> [2] https://issues.apache.org/jira/browse/STANBOL-1047
> [3] 
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
> 
>> 
>> 
>> 
>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
>> [email protected]> wrote:
>> 
>>> Hi Rajan,
>>> 
>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>> 
>>> 'You have not indexed a single entity. So something in your indexing
>>> configuration is wrong. Most likely you are not correctly building the
>>> URIs of the entities from the incoming_links.txt file. Can you provide
>>> me an example line of the 'incoming_links.txt' file and the contents
>>> of the 'iditerator.properties' file. Those specify how Entity URIs are
>>> built.
>>> 
>>> Short answers to the other questions
>>> 
>>> 
>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <[email protected]> wrote:
>>>> it ran for almost 3 days and generated index.
>>> 
>>> Thats good. It means you do have now the Freebase dump in your Jena
>>> TDB triple store. You will not need to repeat this (until you want to
>>> use a newer dump. On the next call to the indexing tool it will
>>> immediately start with the indexing step.
>>> 
>>> 
>>>> 
>>>> Couple questions come to mind:
>>>> 
>>>> a. Is there any particular log/error file the process generates besides
>>>> printing out on stdout/stderr?
>>> 
>>> The indexer writes a zip archive with the IDs of all the indexed
>>> entities. Its in the indexing/destination folder.
>>> 
>>>> b. Is it a must-have to have stanbol full launcher running all the time
>>>> while indexing is going on?
>>> 
>>> No Stanbol instance is needed by the indexing process.
>>> 
>>>> c. Is it possible that, if the machine is not connected to internet for
>>>> couple minutes could cause some issues?
>>> 
>>> No Internet connectivity is needed during indexing. Only if you want
>>> to use the namespace prefix mappings of prefix.cc you need to have
>>> internet connectivity when starting the indexing tool.
>>> 
>>> best
>>> Rupert
>>> 
>>>> 
>>>> I would really appreciate, if you can shed some light on "what could be
>>>> wrong" or "potential approach to nail down this issue"? If you need, I am
>>>> happy to share any additional logs/properties.
>>>> 
>>>> With best regards,
>>>> Rajan
>>>> 
>>>> *1. Configuration changes*
>>>> 
>>>> a. set ns-prefix-state=false*
>>>> [within /indexing/config/iditerator.properties]*
>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
>>>> [within namespaceprefix.mappings]*
>>>> c. enable bunch of properties within mappings.txt such as following
>>>> 
>>>> fb:music.artist.genre
>>>> fb:music.artist.label
>>>> fb:music.artist.album
>>>> 
>>>> *2. Contents of indexing/dist directory*
>>>> 
>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
>>>> -rw-r--r--  3457 May 22 05:11
>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
>>>> 
>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory*
>>>> 
>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
>>>> 
>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
>>>> 
>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
>>>> 
>>>> *5. The indexer log*
>>>> 
>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
>>>> 2429800000 triples (80.97554853864854%)*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>> triples data phase*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - ** Data:
>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>> second]*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Start
>>>> triples index phase*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>> triples index phase*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>> triples load*
>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>> Completed:
>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>> second]*
>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    - completed in
>>>> 157675 seconds
>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore File
>>> {}
>>>> because of unknown extension
>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    - completed in 0
>>>> seconds
>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
>>> imported
>>>> in 157675 seconds
>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File ...
>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
>>> imported
>>>> in 0 seconds
>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
>>>> IndexedEntityId file
>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation completed
>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
>>> completed
>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
>>>> 
>>>> 
>>>> 
>>>> 04:45:48,075 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
>>>> 04:45:48,076 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping ignored!
>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping ignored!
>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
>>> impl.IndexerImpl
>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>> impl.IndexerImpl
>>>> -  > current sequence : 0
>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>> impl.IndexerImpl
>>>> -  > new sequence: 1
>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>> impl.IndexerImpl
>>>> - Send end-of-queue to Deamons with Sequence 1
>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>> current sequence : 1
>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>> new sequence: 2
>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>> Send end-of-queue to Deamons with Sequence 2
>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>> impl.IndexerImpl -
>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>> impl.IndexerImpl -
>>>>> current sequence : 2
>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>> impl.IndexerImpl -
>>>>> new sequence: 3
>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>> impl.IndexerImpl -
>>>> Send end-of-queue to Deamons with Sequence 3
>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>>> processing:  -1.000ms/item | queue:  -1.000ms*
>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
>>>> (sequence=3) ...
>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl -  > current sequence : 3
>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl -  > new sequence: 4
>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>> impl.IndexerImpl
>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>> impl.IndexerImpl
>>>> -  > current sequence : 4
>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing completed
>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing ...
>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started ...
>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
>>> finished
>>>> ...
>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
>>>> 
>>>> 
>>>> 
>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
>>>> [email protected]> wrote:
>>>> 
>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <[email protected]> wrote:
>>>>>> Hi Rupert and Antonio,
>>>>>> 
>>>>>> Thanks a lot for the reply.
>>>>>> 
>>>>>> I start to follow Rupert's suggestion, however it failed again at
>>>>>> 
>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
>>>>> illegal
>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be resolved
>>> for
>>>>>> the entire file?
>>>>> 
>>>>> The indexing tool uses Apache Jena. An those are Jena parsing errors.
>>>>> So the Jena Mailing lists would be the better place to look for
>>>>> answers.
>>>>> This specific issue looks like an invalid URI that is not fixed by the
>>>>> fixit script.
>>>>> 
>>>>> 
>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem to
>>> be
>>>>>> open.
>>>>>> 
>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
>>>>>> --add-header="x-amz-request-payer: requester"
>>>>>> ERROR: Access to bucket 'basekb-now' was denied
>>>>>> 
>>>>>> 
>>>>>> *Couple additional questions:*
>>>>>> 
>>>>>> *1. indexing enhancements:*
>>>>>> What settings/properties one can tweak to gain most out of the
>>> indexing.
>>>>> 
>>>>> In general you do only want information as needed for your application
>>>>> case in the index.
>>>>> For EntityLinking only labels and type are required.
>>>>> Additional properties will only be used for dereferencing Entities. So
>>>>> this will depend on your application needs (your dereferencing
>>>>> configuration).
>>>>> 
>>>>> In general I try to exclude as much information as possible form the
>>>>> index to keep the size of the Solr Index as small as possible.
>>>>> 
>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... within
>>>>>> freebase
>>>>>> b. potential optimizations to speed up the overall indexing
>>>>> 
>>>>> Most of the time will be needed to load the Freebase dump into Jena
>>>>> TDB. Even with an SSD equipped Server this will take several days.
>>>>> Assigning more RAM will speed up this process as Jena TDB can cache
>>>>> more things in RAM.
>>>>> 
>>>>> Usually it is a good Idea to cancel the indexing process after the
>>>>> importing of the RDF data has finished (and the indexing of the
>>>>> Entities has started). This is because after indexing all the RAM will
>>>>> be used by Jena TDB for caching stuff that is no longer needed in the
>>>>> read-only operations during indexing. So a fresh start can speed up
>>>>> the indexing part of the process.
>>>>> 
>>>>> Also have a look at the Freebase Indexing Tool Readme
>>>>> 
>>>>>> 
>>>>>> *2. demo:*
>>>>>> I see that, in recent github commit(s) the eHealth and other demos
>>> have
>>>>>> been commented out. How can I get demo source code and other
>>> components
>>>>> for
>>>>>> these demos. I prefer to build it myself to see the power of stanbol.
>>>>> 
>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
>>>>> compatible to the trunk version.
>>>>> 
>>>>>> *3. custom vocabulary:*
>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a preferred
>>> way
>>>>>> to upload it to Stanbol and have it recognize my entities?
>>>>> 
>>>>> Google Refine[2] with the RDF extension [3]. You can also try to use
>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
>>>>> but AFAIK this combination is not so stable and might not work at all.
>>>>> 
>>>>> * Google Refine allows you to import your CSV file.
>>>>> * Clean it up (if necessary)
>>>>> * The RDF extension allows you to map your CSV data to RDF
>>>>> * based on this mapping you can save your data as RDF
>>>>> * after that you can import the RDF data to Apache Stanbol
>>>>> 
>>>>> hope this helps
>>>>> best
>>>>> Rupert
>>>>> 
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> Rajan
>>>>> 
>>>>> 
>>>>> 
>>>>> [1]
>>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
>>>>> [2] https://code.google.com/p/google-refine/
>>>>> [3] http://refine.deri.ie/
>>>>> [4] http://openrefine.org/
>>>>> 
>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
>>>>>> [email protected]> wrote:
>>>>>> 
>>>>>>> Hi Rajan,
>>>>>>> 
>>>>>>> I think this is because you named you file
>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
>>> format
>>>>>>> is not provided by the file extension. Renaming the file to
>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
>>>>>>> 
>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option.
>>>>>>> 
>>>>>>> best
>>>>>>> Rupert
>>>>>>> 
>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
>>>>>>> <[email protected]> wrote:
>>>>>>>> Hi Rajan
>>>>>>>> 
>>>>>>>> Freebase dump contains some things that does not fit very well with
>>>>> the
>>>>>>>> indexer.
>>>>>>>> I advise you to use the dump provided by BaseKB (http://basekb.com
>>> )
>>>>>>> which
>>>>>>>> is a curated Freebase dump.
>>>>>>>> I did not have any problem indexing it using that dump.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I am working on indexing Freebase data within EntityHub and
>>> observed
>>>>>>>>> following issue:
>>>>>>>>> 
>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>> Element
>>>>> or
>>>>>>>>> attribute do not match QName production:
>>> QName::=(NCName':')?NCName.
>>>>>>>>> 
>>>>>>>>> I would appreciate any help pertaining to this issue.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Rajan
>>>>>>>>> 
>>>>>>>>> *Steps followed:*
>>>>>>>>> 
>>>>>>>>> *1. Initialization: *
>>>>>>>>> java -jar
>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>>>>>> init
>>>>>>>>> 
>>>>>>>>> *2. Download the data:*
>>>>>>>>> Download data and copy it to
>>>>>>> https://developers.google.com/freebase/data
>>>>>>>>> 
>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
>>>>>>>>> It generated incoming_links.txt under resources directory as
>>> follows
>>>>>>>>> 
>>>>>>>>> 10888430 m.0kpv11
>>>>>>>>> 3741261 m.019h
>>>>>>>>> 2667858 m.0775xx5
>>>>>>>>> 2667804 m.0775xvm
>>>>>>>>> 1875352 m.01xryvm
>>>>>>>>> 1739262 m.05zppz
>>>>>>>>> 1369590 m.01xrzlb
>>>>>>>>> 
>>>>>>>>> *4. Performed execution of fixit script*
>>>>>>>>> 
>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
>>>>>>>>> 
>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
>>>>>>>>> to indexing/resources/rdfdata
>>>>>>>>> 
>>>>>>>>> *6. config/iditer.properties file has following setting*
>>>>>>>>> #id-namespace=http://freebase.com/
>>>>>>>>> ns-prefix-state=false
>>>>>>>>> 
>>>>>>>>> *7. Performed run of following command:*
>>>>>>>>> java -jar -Xmx32g
>>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>> index
>>>>>>>>> 
>>>>>>>>> The error dump on stdout is as follows:
>>>>>>>>> 
>>>>>>>>> 01:37:32,884 [Thread-0] INFO
>>> solryard.SolrYardIndexingDestination -
>>>>>>> ...
>>>>>>>>> copy Solr Configuration form
>>>>>>> /private/tmp/freebase/indexing/config/freebase
>>>>>>>>> to
>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase
>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -     -
>>>>> bulk
>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>> Start
>>>>>>>>> triples data phase
>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>> Load
>>>>>>> empty
>>>>>>>>> triples table
>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>>>> Element or
>>>>>>>>> attribute do not match QName production:
>>> QName::=(NCName':')?NCName.*
>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>> Finish
>>>>>>>>> triples data phase
>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>> Finish
>>>>>>>>> triples load
>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore Error
>>>>> for
>>>>>>> File
>>>>>>>>> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
>>> and
>>>>>>>>> continue
>>>>>>>>> 
>>>>>>>>> Additional Reference Point:
>>>>>>>>> 
>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
>>>>>>>>> freebase-rdf-latest.gz
>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
>>>>>>>>> freebase-rdf-latest-fixed.gz
>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 incoming_links.txt
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> ------------------------------
>>>>>>>> This message should be regarded as confidential. If you have
>>> received
>>>>>>> this
>>>>>>>> email in error please notify the sender and destroy it immediately.
>>>>>>>> Statements of intent shall only become binding when confirmed in
>>> hard
>>>>>>> copy
>>>>>>>> by an authorised signatory.
>>>>>>>> 
>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>> number
>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>> Road,
>>>>>>>> London W6 7AN.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> | Rupert Westenthaler             [email protected]
>>>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>> | REDLINK.CO
>>> ..........................................................................
>>>>>>> | http://redlink.co/
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> | Rupert Westenthaler             [email protected]
>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>> | REDLINK.CO
>>> ..........................................................................
>>>>> | http://redlink.co/
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             [email protected]
>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> | REDLINK.CO
>>> ..........................................................................
>>> | http://redlink.co/
> 
> 
> 
> -- 
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO 
> ..........................................................................
> | http://redlink.co/

Re: Entityhub indexing for Freebase data

Reply via email to