Re: Entityhub indexing for Freebase data

Rupert Westenthaler Thu, 28 May 2015 02:44:28 -0700

Hi,

Please have a look at the stanbol log file (./stanbol/log/error.log).
The schema.xml of the freebase indexing tool uses Solr Analyzers that
are not included by all Stanbol Launchers. If you are missing some
things you will see according exceptions in the log.


Installation will extract the index from the archive and copy it to
the ./stanbol/indexes. So depending on the size the installation may
take some time.

best
Rupert


On Wed, May 27, 2015 at 3:01 PM, Rajan Shah <[email protected]> wrote:
> Hi Rupert,
>
> Finally, I got the freebase index after 2 days run. For english language
> only, the size is roughly 28G.
>
> Surprisingly, after I installed it via OSGI console it created Referenced
> Site and Solr Yard. However, it's not visible within entityhub sites. I did
> configure following parameters within SolrYard
>
> a. "Allow Initialization" - checked
> b. Index configuration: freebase.solrindex.zip
>
> I also re-started couple times but no luck.
>
> Does it require any additional special configuration? i.e. do I need to
> have higher -Xmx parameter setting or something else
>
> With best regards,
> Rajan
>
> On Tue, May 26, 2015 at 9:06 AM, <[email protected]> wrote:
>
>> Hi,
>>
>> Accidentally, I wiped out logs for a clean start. At the same time, I am
>> planning to run on a higher end AWS instance as well, so will keep you
>> posted.
>>
>> Thanks again for your continuous help.
>>
>> With best regards,
>> Rajan
>>
>> Sent from my iPhone
>>
>> > On May 26, 2015, at 8:47 AM, Rupert Westenthaler <
>> [email protected]> wrote:
>> >
>> > HI
>> >
>> >> On Tue, May 26, 2015 at 2:13 PM,  <[email protected]> wrote:
>> >> Hi Rupert,
>> >>
>> >> After last failure, I am only using language=en and it still fails.
>> >
>> > Can you provide the some lines of logging before the OOM. I would like
>> > to be sure that it really happens during the Solr optimization phase.
>> >
>> >> Thanks for the timely answer. Just to double confirm, if I re-started
>> the index command this am again with higher -Xmx option is it too late to
>> run finalise correct?
>> >
>> > If the OOM exception really happened during the Solr optimization calling
>> >
>> >   java -jar -Xmx{higher-value}g
>> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>> > finalise
>> >
>> > will use the data of the previous indexing call and just repeat the
>> > finalization steps
>> >
>> > best
>> > Rupert
>> >
>> >
>> >> With best regards,
>> >> Rajan
>> >>
>> >> Sent from my iPhone
>> >>
>> >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler <
>> [email protected]> wrote:
>> >>>
>> >>> Hi Rajan
>> >>>
>> >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <[email protected]>
>> wrote:
>> >>>> Hi Rupert,
>> >>>>
>> >>>> Thanks for the reply.
>> >>>>
>> >>>> As per your suggestion, I made necessary changes however it failed
>> with
>> >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however
>> at this
>> >>>> point it's a trial and error approach with several days effort being
>> >>>> wasted.
>> >>>
>> >>> I guess you are getting the OutOfMemory while optimizing the Solr
>> >>> Index (right?). The README [1] explicitly notes that a high amount of
>> >>> memory is needed by exactly this step of the indexing process.
>> >>>
>> >>> If the indexing fails at this step you can call the indexing tool with
>> >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
>> >>> for details). This will prevent the indexing to be repeated and only
>> >>> execute the finalization steps (optimizing the Solr Index and creating
>> >>> the freebase.solrindex.zip file).
>> >>>
>> >>>
>> >>>> I am just throwing out an idea, but wanted to see
>> >>>>
>> >>>> a. Is it possible to publish set of constraints and required
>> parameters.
>> >>>> i.e. with minimal set of entities within mappings.txt, one need to set
>> >>>> these parameters?
>> >>>
>> >>> I do not understand this question. Do you want to filter entities
>> >>> based on their information? If so you might want to have a look at the
>> >>>
>> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
>> >>> The generic RDF indexing tool as an example on how to use this
>> >>> processor to filter entities based on their rdf:type values.
>> >>>
>> >>> See also the "Entity Filters" section of [3]
>> >>>
>> >>>>
>> >>>> b. Is it possible to split the file based on subject? generate smaller
>> >>>> index for each subject and merge afterwards?
>> >>>
>> >>> Yes. You can split up the dump (by subject). Import those parts in
>> >>> different Indexing Tool instances (meaning different Jena TDB
>> >>> instances). Importing 4*500million triples to Jena TDB is supposed to
>> >>> be much faster as 1*2Billion.
>> >>>
>> >>> If you still want to have all data in a single Entityhub Site you need
>> >>> to script the indexing process.
>> >>>
>> >>> * call indexing for the first part
>> >>> * after this finishes link the {part1}/indexing/destination/indexes
>> >>> folder to {part2..n}/indexing/destination/indexes
>> >>> * call indexing for the 2..n parts.
>> >>>
>> >>> As the indexing tool only adds additional information to the Solr
>> >>> Index you will get the union over all parts at the end of the process.
>> >>> All parts need to use the full incoming_links.txt file because
>> >>> otherwise the rankings would not be correct.
>> >>>
>> >>> The "Indexing Datasets separately" section of [3] describes a similar
>> >>> trick of creating an union index over multiple datasets.
>> >>>
>> >>>
>> >>> best
>> >>> Rupert
>> >>>
>> >>>> c. Work with BaseKB guys to also make it available at nominal charge?
>> >>>>
>> >>>> d. Maybe apply some Map/Reduce - extension of idea b
>> >>>>
>> >>>> With best regards,
>> >>>> Rajan
>> >>>
>> >>>
>> >>>
>> >>> [1]
>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
>> >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047
>> >>> [3]
>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
>> >>>> [email protected]> wrote:
>> >>>>
>> >>>>> Hi Rajan,
>> >>>>>
>> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>> >>>>>
>> >>>>> 'You have not indexed a single entity. So something in your indexing
>> >>>>> configuration is wrong. Most likely you are not correctly building
>> the
>> >>>>> URIs of the entities from the incoming_links.txt file. Can you
>> provide
>> >>>>> me an example line of the 'incoming_links.txt' file and the contents
>> >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs
>> are
>> >>>>> built.
>> >>>>>
>> >>>>> Short answers to the other questions
>> >>>>>
>> >>>>>
>> >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <[email protected]>
>> wrote:
>> >>>>>> it ran for almost 3 days and generated index.
>> >>>>>
>> >>>>> Thats good. It means you do have now the Freebase dump in your Jena
>> >>>>> TDB triple store. You will not need to repeat this (until you want to
>> >>>>> use a newer dump. On the next call to the indexing tool it will
>> >>>>> immediately start with the indexing step.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> Couple questions come to mind:
>> >>>>>>
>> >>>>>> a. Is there any particular log/error file the process generates
>> besides
>> >>>>>> printing out on stdout/stderr?
>> >>>>>
>> >>>>> The indexer writes a zip archive with the IDs of all the indexed
>> >>>>> entities. Its in the indexing/destination folder.
>> >>>>>
>> >>>>>> b. Is it a must-have to have stanbol full launcher running all the
>> time
>> >>>>>> while indexing is going on?
>> >>>>>
>> >>>>> No Stanbol instance is needed by the indexing process.
>> >>>>>
>> >>>>>> c. Is it possible that, if the machine is not connected to internet
>> for
>> >>>>>> couple minutes could cause some issues?
>> >>>>>
>> >>>>> No Internet connectivity is needed during indexing. Only if you want
>> >>>>> to use the namespace prefix mappings of prefix.cc you need to have
>> >>>>> internet connectivity when starting the indexing tool.
>> >>>>>
>> >>>>> best
>> >>>>> Rupert
>> >>>>>
>> >>>>>>
>> >>>>>> I would really appreciate, if you can shed some light on "what
>> could be
>> >>>>>> wrong" or "potential approach to nail down this issue"? If you
>> need, I am
>> >>>>>> happy to share any additional logs/properties.
>> >>>>>>
>> >>>>>> With best regards,
>> >>>>>> Rajan
>> >>>>>>
>> >>>>>> *1. Configuration changes*
>> >>>>>>
>> >>>>>> a. set ns-prefix-state=false*
>> >>>>>> [within /indexing/config/iditerator.properties]*
>> >>>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
>> >>>>>> [within namespaceprefix.mappings]*
>> >>>>>> c. enable bunch of properties within mappings.txt such as following
>> >>>>>>
>> >>>>>> fb:music.artist.genre
>> >>>>>> fb:music.artist.label
>> >>>>>> fb:music.artist.album
>> >>>>>>
>> >>>>>> *2. Contents of indexing/dist directory*
>> >>>>>>
>> >>>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
>> >>>>>> -rw-r--r--  3457 May 22 05:11
>> >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
>> >>>>>>
>> >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory*
>> >>>>>>
>> >>>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
>> >>>>>>
>> >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
>> >>>>>>
>> >>>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
>> >>>>>>
>> >>>>>> *5. The indexer log*
>> >>>>>>
>> >>>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
>> >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
>> >>>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter -
>> Filtered:
>> >>>>>> 2429800000 triples (80.97554853864854%)*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> Finish
>> >>>>>> triples data phase*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>> Data:
>> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>> >>>>>> second]*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> Start
>> >>>>>> triples index phase*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> Finish
>> >>>>>> triples index phase*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> Finish
>> >>>>>> triples load*
>> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>> >>>>> Completed:
>> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>> >>>>>> second]*
>> >>>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
>> >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
>> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    -
>> completed in
>> >>>>>> 157675 seconds
>> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
>> >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
>> >>>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore
>> File
>> >>>>> {}
>> >>>>>> because of unknown extension
>> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    -
>> completed in 0
>> >>>>>> seconds
>> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
>> >>>>> imported
>> >>>>>> in 157675 seconds
>> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File
>> ...
>> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
>> >>>>> imported
>> >>>>>> in 0 seconds
>> >>>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
>> >>>>>> IndexedEntityId file
>> >>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
>> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation
>> completed
>> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
>> >>>>> completed
>> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
>> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> 04:45:48,075 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
>> >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,076 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
>> >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
>> >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
>> >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
>> >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
>> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
>> >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping
>> ignored!
>> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
>> >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping
>> ignored!
>> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
>> impl.NamespacePrefixProviderImpl -
>> >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
>> >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
>> >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
>> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> -  > current sequence : 0
>> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> -  > new sequence: 1
>> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> - Send end-of-queue to Deamons with Sequence 1
>> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
>> impl.IndexerImpl -
>> >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
>> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
>> impl.IndexerImpl -
>> >>>>>>> current sequence : 1
>> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
>> impl.IndexerImpl -
>> >>>>>>> new sequence: 2
>> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
>> impl.IndexerImpl -
>> >>>>>> Send end-of-queue to Deamons with Sequence 2
>> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>> >>>>> impl.IndexerImpl -
>> >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
>> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>> >>>>> impl.IndexerImpl -
>> >>>>>>> current sequence : 2
>> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>> >>>>> impl.IndexerImpl -
>> >>>>>>> new sequence: 3
>> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>> >>>>> impl.IndexerImpl -
>> >>>>>> Send end-of-queue to Deamons with Sequence 3
>> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>> >>>>>> processing:  -1.000ms/item | queue:  -1.000ms*
>> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
>> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
>> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
>> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
>> >>>>>> (sequence=3) ...
>> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl -  > current sequence : 3
>> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl -  > new sequence: 4
>> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>> >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
>> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
>> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>> >>>>> impl.IndexerImpl
>> >>>>>> -  > current sequence : 4
>> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing
>> completed
>> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing
>> ...
>> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started
>> ...
>> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
>> >>>>> finished
>> >>>>>> ...
>> >>>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
>> >>>>>> [email protected]> wrote:
>> >>>>>>
>> >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <[email protected]>
>> wrote:
>> >>>>>>>> Hi Rupert and Antonio,
>> >>>>>>>>
>> >>>>>>>> Thanks a lot for the reply.
>> >>>>>>>>
>> >>>>>>>> I start to follow Rupert's suggestion, however it failed again at
>> >>>>>>>>
>> >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
>> >>>>>>> illegal
>> >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be
>> resolved
>> >>>>> for
>> >>>>>>>> the entire file?
>> >>>>>>>
>> >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing
>> errors.
>> >>>>>>> So the Jena Mailing lists would be the better place to look for
>> >>>>>>> answers.
>> >>>>>>> This specific issue looks like an invalid URI that is not fixed by
>> the
>> >>>>>>> fixit script.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem
>> to
>> >>>>> be
>> >>>>>>>> open.
>> >>>>>>>>
>> >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
>> >>>>>>>> --add-header="x-amz-request-payer: requester"
>> >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> *Couple additional questions:*
>> >>>>>>>>
>> >>>>>>>> *1. indexing enhancements:*
>> >>>>>>>> What settings/properties one can tweak to gain most out of the
>> >>>>> indexing.
>> >>>>>>>
>> >>>>>>> In general you do only want information as needed for your
>> application
>> >>>>>>> case in the index.
>> >>>>>>> For EntityLinking only labels and type are required.
>> >>>>>>> Additional properties will only be used for dereferencing
>> Entities. So
>> >>>>>>> this will depend on your application needs (your dereferencing
>> >>>>>>> configuration).
>> >>>>>>>
>> >>>>>>> In general I try to exclude as much information as possible form
>> the
>> >>>>>>> index to keep the size of the Solr Index as small as possible.
>> >>>>>>>
>> >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc...
>> within
>> >>>>>>>> freebase
>> >>>>>>>> b. potential optimizations to speed up the overall indexing
>> >>>>>>>
>> >>>>>>> Most of the time will be needed to load the Freebase dump into Jena
>> >>>>>>> TDB. Even with an SSD equipped Server this will take several days.
>> >>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache
>> >>>>>>> more things in RAM.
>> >>>>>>>
>> >>>>>>> Usually it is a good Idea to cancel the indexing process after the
>> >>>>>>> importing of the RDF data has finished (and the indexing of the
>> >>>>>>> Entities has started). This is because after indexing all the RAM
>> will
>> >>>>>>> be used by Jena TDB for caching stuff that is no longer needed in
>> the
>> >>>>>>> read-only operations during indexing. So a fresh start can speed up
>> >>>>>>> the indexing part of the process.
>> >>>>>>>
>> >>>>>>> Also have a look at the Freebase Indexing Tool Readme
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> *2. demo:*
>> >>>>>>>> I see that, in recent github commit(s) the eHealth and other demos
>> >>>>> have
>> >>>>>>>> been commented out. How can I get demo source code and other
>> >>>>> components
>> >>>>>>> for
>> >>>>>>>> these demos. I prefer to build it myself to see the power of
>> stanbol.
>> >>>>>>>
>> >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
>> >>>>>>> compatible to the trunk version.
>> >>>>>>>
>> >>>>>>>> *3. custom vocabulary:*
>> >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a
>> preferred
>> >>>>> way
>> >>>>>>>> to upload it to Stanbol and have it recognize my entities?
>> >>>>>>>
>> >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to
>> use
>> >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
>> >>>>>>> but AFAIK this combination is not so stable and might not work at
>> all.
>> >>>>>>>
>> >>>>>>> * Google Refine allows you to import your CSV file.
>> >>>>>>> * Clean it up (if necessary)
>> >>>>>>> * The RDF extension allows you to map your CSV data to RDF
>> >>>>>>> * based on this mapping you can save your data as RDF
>> >>>>>>> * after that you can import the RDF data to Apache Stanbol
>> >>>>>>>
>> >>>>>>> hope this helps
>> >>>>>>> best
>> >>>>>>> Rupert
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> Thanks in advance,
>> >>>>>>>> Rajan
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [1]
>> >>>>>
>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
>> >>>>>>> [2] https://code.google.com/p/google-refine/
>> >>>>>>> [3] http://refine.deri.ie/
>> >>>>>>> [4] http://openrefine.org/
>> >>>>>>>
>> >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
>> >>>>>>>> [email protected]> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hi Rajan,
>> >>>>>>>>>
>> >>>>>>>>> I think this is because you named you file
>> >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
>> >>>>> format
>> >>>>>>>>> is not provided by the file extension. Renaming the file to
>> >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
>> >>>>>>>>>
>> >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option.
>> >>>>>>>>>
>> >>>>>>>>> best
>> >>>>>>>>> Rupert
>> >>>>>>>>>
>> >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
>> >>>>>>>>> <[email protected]> wrote:
>> >>>>>>>>>> Hi Rajan
>> >>>>>>>>>>
>> >>>>>>>>>> Freebase dump contains some things that does not fit very well
>> with
>> >>>>>>> the
>> >>>>>>>>>> indexer.
>> >>>>>>>>>> I advise you to use the dump provided by BaseKB (
>> http://basekb.com
>> >>>>> )
>> >>>>>>>>> which
>> >>>>>>>>>> is a curated Freebase dump.
>> >>>>>>>>>> I did not have any problem indexing it using that dump.
>> >>>>>>>>>>
>> >>>>>>>>>> Regards
>> >>>>>>>>>>
>> >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <[email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hi,
>> >>>>>>>>>>>
>> >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and
>> >>>>> observed
>> >>>>>>>>>>> following issue:
>> >>>>>>>>>>>
>> >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>> >>>>> Element
>> >>>>>>> or
>> >>>>>>>>>>> attribute do not match QName production:
>> >>>>> QName::=(NCName':')?NCName.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I would appreciate any help pertaining to this issue.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>> Rajan
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Steps followed:*
>> >>>>>>>>>>>
>> >>>>>>>>>>> *1. Initialization: *
>> >>>>>>>>>>> java -jar
>> >>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>> >>>>>>>>>>> init
>> >>>>>>>>>>>
>> >>>>>>>>>>> *2. Download the data:*
>> >>>>>>>>>>> Download data and copy it to
>> >>>>>>>>> https://developers.google.com/freebase/data
>> >>>>>>>>>>>
>> >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
>> >>>>>>>>>>> It generated incoming_links.txt under resources directory as
>> >>>>> follows
>> >>>>>>>>>>>
>> >>>>>>>>>>> 10888430 m.0kpv11
>> >>>>>>>>>>> 3741261 m.019h
>> >>>>>>>>>>> 2667858 m.0775xx5
>> >>>>>>>>>>> 2667804 m.0775xvm
>> >>>>>>>>>>> 1875352 m.01xryvm
>> >>>>>>>>>>> 1739262 m.05zppz
>> >>>>>>>>>>> 1369590 m.01xrzlb
>> >>>>>>>>>>>
>> >>>>>>>>>>> *4. Performed execution of fixit script*
>> >>>>>>>>>>>
>> >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
>> >>>>>>>>>>>
>> >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
>> >>>>>>>>>>> to indexing/resources/rdfdata
>> >>>>>>>>>>>
>> >>>>>>>>>>> *6. config/iditer.properties file has following setting*
>> >>>>>>>>>>> #id-namespace=http://freebase.com/
>> >>>>>>>>>>> ns-prefix-state=false
>> >>>>>>>>>>>
>> >>>>>>>>>>> *7. Performed run of following command:*
>> >>>>>>>>>>> java -jar -Xmx32g
>> >>>>>>>>>>>
>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>> >>>>>>> index
>> >>>>>>>>>>>
>> >>>>>>>>>>> The error dump on stdout is as follows:
>> >>>>>>>>>>>
>> >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO
>> >>>>> solryard.SolrYardIndexingDestination -
>> >>>>>>>>> ...
>> >>>>>>>>>>> copy Solr Configuration form
>> >>>>>>>>> /private/tmp/freebase/indexing/config/freebase
>> >>>>>>>>>>> to
>> >>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase
>> >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -
>>  -
>> >>>>>>> bulk
>> >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
>> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> >>>>> Start
>> >>>>>>>>>>> triples data phase
>> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>> >>>>> Load
>> >>>>>>>>> empty
>> >>>>>>>>>>> triples table
>> >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>> >>>>>>> Element or
>> >>>>>>>>>>> attribute do not match QName production:
>> >>>>> QName::=(NCName':')?NCName.*
>> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> >>>>> Finish
>> >>>>>>>>>>> triples data phase
>> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>> >>>>> Finish
>> >>>>>>>>>>> triples load
>> >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore
>> Error
>> >>>>>>> for
>> >>>>>>>>> File
>> >>>>>>>>>>>
>> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
>> >>>>> and
>> >>>>>>>>>>> continue
>> >>>>>>>>>>>
>> >>>>>>>>>>> Additional Reference Point:
>> >>>>>>>>>>>
>> >>>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
>> >>>>>>>>>>> freebase-rdf-latest.gz
>> >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
>> >>>>>>>>>>> freebase-rdf-latest-fixed.gz
>> >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42
>> incoming_links.txt
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------------------
>> >>>>>>>>>> This message should be regarded as confidential. If you have
>> >>>>> received
>> >>>>>>>>> this
>> >>>>>>>>>> email in error please notify the sender and destroy it
>> immediately.
>> >>>>>>>>>> Statements of intent shall only become binding when confirmed in
>> >>>>> hard
>> >>>>>>>>> copy
>> >>>>>>>>>> by an authorised signatory.
>> >>>>>>>>>>
>> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
>> registration
>> >>>>>>> number
>> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
>> Bush
>> >>>>>>> Road,
>> >>>>>>>>>> London W6 7AN.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> | Rupert Westenthaler             [email protected]
>> >>>>>>>>> | Bodenlehenstraße 11
>> ++43-699-11108907
>> >>>>>>>>> | A-5500 Bischofshofen
>> >>>>>>>>> | REDLINK.CO
>> >>>>>
>> ..........................................................................
>> >>>>>>>>> | http://redlink.co/
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> | Rupert Westenthaler             [email protected]
>> >>>>>>> | Bodenlehenstraße 11
>> ++43-699-11108907
>> >>>>>>> | A-5500 Bischofshofen
>> >>>>>>> | REDLINK.CO
>> >>>>>
>> ..........................................................................
>> >>>>>>> | http://redlink.co/
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> | Rupert Westenthaler             [email protected]
>> >>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>> >>>>> | A-5500 Bischofshofen
>> >>>>> | REDLINK.CO
>> >>>>>
>> ..........................................................................
>> >>>>> | http://redlink.co/
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> | Rupert Westenthaler             [email protected]
>> >>> | Bodenlehenstraße 11                              ++43-699-11108907
>> >>> | A-5500 Bischofshofen
>> >>> | REDLINK.CO
>> ..........................................................................
>> >>> | http://redlink.co/
>> >
>> >
>> >
>> > --
>> > | Rupert Westenthaler             [email protected]
>> > | Bodenlehenstraße 11                              ++43-699-11108907
>> > | A-5500 Bischofshofen
>> > | REDLINK.CO
>> ..........................................................................
>> > | http://redlink.co/
>>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Re: Entityhub indexing for Freebase data

Reply via email to