Re: Entityhub indexing for Freebase data

Rajan Shah Wed, 27 May 2015 06:01:55 -0700

Hi Rupert,

Finally, I got the freebase index after 2 days run. For english language
only, the size is roughly 28G.


Surprisingly, after I installed it via OSGI console it created Referenced
Site and Solr Yard. However, it's not visible within entityhub sites. I did
configure following parameters within SolrYard

a. "Allow Initialization" - checked
b. Index configuration: freebase.solrindex.zip

I also re-started couple times but no luck.

Does it require any additional special configuration? i.e. do I need to
have higher -Xmx parameter setting or something else

With best regards,
Rajan

On Tue, May 26, 2015 at 9:06 AM, <[email protected]> wrote:

> Hi,
>
> Accidentally, I wiped out logs for a clean start. At the same time, I am
> planning to run on a higher end AWS instance as well, so will keep you
> posted.
>
> Thanks again for your continuous help.
>
> With best regards,
> Rajan
>
> Sent from my iPhone
>
> > On May 26, 2015, at 8:47 AM, Rupert Westenthaler <
> [email protected]> wrote:
> >
> > HI
> >
> >> On Tue, May 26, 2015 at 2:13 PM,  <[email protected]> wrote:
> >> Hi Rupert,
> >>
> >> After last failure, I am only using language=en and it still fails.
> >
> > Can you provide the some lines of logging before the OOM. I would like
> > to be sure that it really happens during the Solr optimization phase.
> >
> >> Thanks for the timely answer. Just to double confirm, if I re-started
> the index command this am again with higher -Xmx option is it too late to
> run finalise correct?
> >
> > If the OOM exception really happened during the Solr optimization calling
> >
> >   java -jar -Xmx{higher-value}g
> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> > finalise
> >
> > will use the data of the previous indexing call and just repeat the
> > finalization steps
> >
> > best
> > Rupert
> >
> >
> >> With best regards,
> >> Rajan
> >>
> >> Sent from my iPhone
> >>
> >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler <
> [email protected]> wrote:
> >>>
> >>> Hi Rajan
> >>>
> >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <[email protected]>
> wrote:
> >>>> Hi Rupert,
> >>>>
> >>>> Thanks for the reply.
> >>>>
> >>>> As per your suggestion, I made necessary changes however it failed
> with
> >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however
> at this
> >>>> point it's a trial and error approach with several days effort being
> >>>> wasted.
> >>>
> >>> I guess you are getting the OutOfMemory while optimizing the Solr
> >>> Index (right?). The README [1] explicitly notes that a high amount of
> >>> memory is needed by exactly this step of the indexing process.
> >>>
> >>> If the indexing fails at this step you can call the indexing tool with
> >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
> >>> for details). This will prevent the indexing to be repeated and only
> >>> execute the finalization steps (optimizing the Solr Index and creating
> >>> the freebase.solrindex.zip file).
> >>>
> >>>
> >>>> I am just throwing out an idea, but wanted to see
> >>>>
> >>>> a. Is it possible to publish set of constraints and required
> parameters.
> >>>> i.e. with minimal set of entities within mappings.txt, one need to set
> >>>> these parameters?
> >>>
> >>> I do not understand this question. Do you want to filter entities
> >>> based on their information? If so you might want to have a look at the
> >>>
> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
> >>> The generic RDF indexing tool as an example on how to use this
> >>> processor to filter entities based on their rdf:type values.
> >>>
> >>> See also the "Entity Filters" section of [3]
> >>>
> >>>>
> >>>> b. Is it possible to split the file based on subject? generate smaller
> >>>> index for each subject and merge afterwards?
> >>>
> >>> Yes. You can split up the dump (by subject). Import those parts in
> >>> different Indexing Tool instances (meaning different Jena TDB
> >>> instances). Importing 4*500million triples to Jena TDB is supposed to
> >>> be much faster as 1*2Billion.
> >>>
> >>> If you still want to have all data in a single Entityhub Site you need
> >>> to script the indexing process.
> >>>
> >>> * call indexing for the first part
> >>> * after this finishes link the {part1}/indexing/destination/indexes
> >>> folder to {part2..n}/indexing/destination/indexes
> >>> * call indexing for the 2..n parts.
> >>>
> >>> As the indexing tool only adds additional information to the Solr
> >>> Index you will get the union over all parts at the end of the process.
> >>> All parts need to use the full incoming_links.txt file because
> >>> otherwise the rankings would not be correct.
> >>>
> >>> The "Indexing Datasets separately" section of [3] describes a similar
> >>> trick of creating an union index over multiple datasets.
> >>>
> >>>
> >>> best
> >>> Rupert
> >>>
> >>>> c. Work with BaseKB guys to also make it available at nominal charge?
> >>>>
> >>>> d. Maybe apply some Map/Reduce - extension of idea b
> >>>>
> >>>> With best regards,
> >>>> Rajan
> >>>
> >>>
> >>>
> >>> [1]
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
> >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047
> >>> [3]
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
> >>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Hi Rajan,
> >>>>>
> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
> >>>>>
> >>>>> 'You have not indexed a single entity. So something in your indexing
> >>>>> configuration is wrong. Most likely you are not correctly building
> the
> >>>>> URIs of the entities from the incoming_links.txt file. Can you
> provide
> >>>>> me an example line of the 'incoming_links.txt' file and the contents
> >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs
> are
> >>>>> built.
> >>>>>
> >>>>> Short answers to the other questions
> >>>>>
> >>>>>
> >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <[email protected]>
> wrote:
> >>>>>> it ran for almost 3 days and generated index.
> >>>>>
> >>>>> Thats good. It means you do have now the Freebase dump in your Jena
> >>>>> TDB triple store. You will not need to repeat this (until you want to
> >>>>> use a newer dump. On the next call to the indexing tool it will
> >>>>> immediately start with the indexing step.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Couple questions come to mind:
> >>>>>>
> >>>>>> a. Is there any particular log/error file the process generates
> besides
> >>>>>> printing out on stdout/stderr?
> >>>>>
> >>>>> The indexer writes a zip archive with the IDs of all the indexed
> >>>>> entities. Its in the indexing/destination folder.
> >>>>>
> >>>>>> b. Is it a must-have to have stanbol full launcher running all the
> time
> >>>>>> while indexing is going on?
> >>>>>
> >>>>> No Stanbol instance is needed by the indexing process.
> >>>>>
> >>>>>> c. Is it possible that, if the machine is not connected to internet
> for
> >>>>>> couple minutes could cause some issues?
> >>>>>
> >>>>> No Internet connectivity is needed during indexing. Only if you want
> >>>>> to use the namespace prefix mappings of prefix.cc you need to have
> >>>>> internet connectivity when starting the indexing tool.
> >>>>>
> >>>>> best
> >>>>> Rupert
> >>>>>
> >>>>>>
> >>>>>> I would really appreciate, if you can shed some light on "what
> could be
> >>>>>> wrong" or "potential approach to nail down this issue"? If you
> need, I am
> >>>>>> happy to share any additional logs/properties.
> >>>>>>
> >>>>>> With best regards,
> >>>>>> Rajan
> >>>>>>
> >>>>>> *1. Configuration changes*
> >>>>>>
> >>>>>> a. set ns-prefix-state=false*
> >>>>>> [within /indexing/config/iditerator.properties]*
> >>>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
> >>>>>> [within namespaceprefix.mappings]*
> >>>>>> c. enable bunch of properties within mappings.txt such as following
> >>>>>>
> >>>>>> fb:music.artist.genre
> >>>>>> fb:music.artist.label
> >>>>>> fb:music.artist.album
> >>>>>>
> >>>>>> *2. Contents of indexing/dist directory*
> >>>>>>
> >>>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
> >>>>>> -rw-r--r--  3457 May 22 05:11
> >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
> >>>>>>
> >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory*
> >>>>>>
> >>>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
> >>>>>>
> >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
> >>>>>>
> >>>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
> >>>>>>
> >>>>>> *5. The indexer log*
> >>>>>>
> >>>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
> >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
> >>>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter -
> Filtered:
> >>>>>> 2429800000 triples (80.97554853864854%)*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> Finish
> >>>>>> triples data phase*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> Data:
> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
> >>>>>> second]*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> Start
> >>>>>> triples index phase*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> Finish
> >>>>>> triples index phase*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> Finish
> >>>>>> triples load*
> >>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >>>>> Completed:
> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
> >>>>>> second]*
> >>>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
> >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    -
> completed in
> >>>>>> 157675 seconds
> >>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
> >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
> >>>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore
> File
> >>>>> {}
> >>>>>> because of unknown extension
> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    -
> completed in 0
> >>>>>> seconds
> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
> >>>>> imported
> >>>>>> in 157675 seconds
> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File
> ...
> >>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
> >>>>> imported
> >>>>>> in 0 seconds
> >>>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
> >>>>>> IndexedEntityId file
> >>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation
> completed
> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
> >>>>> completed
> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
> >>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 04:45:48,075 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
> >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
> >>>>>> 04:45:48,076 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
> >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
> >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
> >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
> >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
> >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping
> ignored!
> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
> >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping
> ignored!
> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN
> impl.NamespacePrefixProviderImpl -
> >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
> >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
> >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> -  > current sequence : 0
> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> -  > new sequence: 1
> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> - Send end-of-queue to Deamons with Sequence 1
> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >>>>>>> current sequence : 1
> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >>>>>>> new sequence: 2
> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO
> impl.IndexerImpl -
> >>>>>> Send end-of-queue to Deamons with Sequence 2
> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >>>>> impl.IndexerImpl -
> >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >>>>> impl.IndexerImpl -
> >>>>>>> current sequence : 2
> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >>>>> impl.IndexerImpl -
> >>>>>>> new sequence: 3
> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
> >>>>> impl.IndexerImpl -
> >>>>>> Send end-of-queue to Deamons with Sequence 3
> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
> >>>>>> processing:  -1.000ms/item | queue:  -1.000ms*
> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
> >>>>>> (sequence=3) ...
> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl -  > current sequence : 3
> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl -  > new sequence: 4
> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
> >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
> >>>>> impl.IndexerImpl
> >>>>>> -  > current sequence : 4
> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing
> completed
> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing
> ...
> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started
> ...
> >>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
> >>>>> finished
> >>>>>> ...
> >>>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <[email protected]>
> wrote:
> >>>>>>>> Hi Rupert and Antonio,
> >>>>>>>>
> >>>>>>>> Thanks a lot for the reply.
> >>>>>>>>
> >>>>>>>> I start to follow Rupert's suggestion, however it failed again at
> >>>>>>>>
> >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
> >>>>>>> illegal
> >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be
> resolved
> >>>>> for
> >>>>>>>> the entire file?
> >>>>>>>
> >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing
> errors.
> >>>>>>> So the Jena Mailing lists would be the better place to look for
> >>>>>>> answers.
> >>>>>>> This specific issue looks like an invalid URI that is not fixed by
> the
> >>>>>>> fixit script.
> >>>>>>>
> >>>>>>>
> >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem
> to
> >>>>> be
> >>>>>>>> open.
> >>>>>>>>
> >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
> >>>>>>>> --add-header="x-amz-request-payer: requester"
> >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *Couple additional questions:*
> >>>>>>>>
> >>>>>>>> *1. indexing enhancements:*
> >>>>>>>> What settings/properties one can tweak to gain most out of the
> >>>>> indexing.
> >>>>>>>
> >>>>>>> In general you do only want information as needed for your
> application
> >>>>>>> case in the index.
> >>>>>>> For EntityLinking only labels and type are required.
> >>>>>>> Additional properties will only be used for dereferencing
> Entities. So
> >>>>>>> this will depend on your application needs (your dereferencing
> >>>>>>> configuration).
> >>>>>>>
> >>>>>>> In general I try to exclude as much information as possible form
> the
> >>>>>>> index to keep the size of the Solr Index as small as possible.
> >>>>>>>
> >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc...
> within
> >>>>>>>> freebase
> >>>>>>>> b. potential optimizations to speed up the overall indexing
> >>>>>>>
> >>>>>>> Most of the time will be needed to load the Freebase dump into Jena
> >>>>>>> TDB. Even with an SSD equipped Server this will take several days.
> >>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache
> >>>>>>> more things in RAM.
> >>>>>>>
> >>>>>>> Usually it is a good Idea to cancel the indexing process after the
> >>>>>>> importing of the RDF data has finished (and the indexing of the
> >>>>>>> Entities has started). This is because after indexing all the RAM
> will
> >>>>>>> be used by Jena TDB for caching stuff that is no longer needed in
> the
> >>>>>>> read-only operations during indexing. So a fresh start can speed up
> >>>>>>> the indexing part of the process.
> >>>>>>>
> >>>>>>> Also have a look at the Freebase Indexing Tool Readme
> >>>>>>>
> >>>>>>>>
> >>>>>>>> *2. demo:*
> >>>>>>>> I see that, in recent github commit(s) the eHealth and other demos
> >>>>> have
> >>>>>>>> been commented out. How can I get demo source code and other
> >>>>> components
> >>>>>>> for
> >>>>>>>> these demos. I prefer to build it myself to see the power of
> stanbol.
> >>>>>>>
> >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
> >>>>>>> compatible to the trunk version.
> >>>>>>>
> >>>>>>>> *3. custom vocabulary:*
> >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a
> preferred
> >>>>> way
> >>>>>>>> to upload it to Stanbol and have it recognize my entities?
> >>>>>>>
> >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to
> use
> >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
> >>>>>>> but AFAIK this combination is not so stable and might not work at
> all.
> >>>>>>>
> >>>>>>> * Google Refine allows you to import your CSV file.
> >>>>>>> * Clean it up (if necessary)
> >>>>>>> * The RDF extension allows you to map your CSV data to RDF
> >>>>>>> * based on this mapping you can save your data as RDF
> >>>>>>> * after that you can import the RDF data to Apache Stanbol
> >>>>>>>
> >>>>>>> hope this helps
> >>>>>>> best
> >>>>>>> Rupert
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks in advance,
> >>>>>>>> Rajan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> [1]
> >>>>>
> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
> >>>>>>> [2] https://code.google.com/p/google-refine/
> >>>>>>> [3] http://refine.deri.ie/
> >>>>>>> [4] http://openrefine.org/
> >>>>>>>
> >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Rajan,
> >>>>>>>>>
> >>>>>>>>> I think this is because you named you file
> >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
> >>>>> format
> >>>>>>>>> is not provided by the file extension. Renaming the file to
> >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
> >>>>>>>>>
> >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option.
> >>>>>>>>>
> >>>>>>>>> best
> >>>>>>>>> Rupert
> >>>>>>>>>
> >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>> Hi Rajan
> >>>>>>>>>>
> >>>>>>>>>> Freebase dump contains some things that does not fit very well
> with
> >>>>>>> the
> >>>>>>>>>> indexer.
> >>>>>>>>>> I advise you to use the dump provided by BaseKB (
> http://basekb.com
> >>>>> )
> >>>>>>>>> which
> >>>>>>>>>> is a curated Freebase dump.
> >>>>>>>>>> I did not have any problem indexing it using that dump.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <[email protected]>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and
> >>>>> observed
> >>>>>>>>>>> following issue:
> >>>>>>>>>>>
> >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
> >>>>> Element
> >>>>>>> or
> >>>>>>>>>>> attribute do not match QName production:
> >>>>> QName::=(NCName':')?NCName.
> >>>>>>>>>>>
> >>>>>>>>>>> I would appreciate any help pertaining to this issue.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Rajan
> >>>>>>>>>>>
> >>>>>>>>>>> *Steps followed:*
> >>>>>>>>>>>
> >>>>>>>>>>> *1. Initialization: *
> >>>>>>>>>>> java -jar
> >>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >>>>>>>>>>> init
> >>>>>>>>>>>
> >>>>>>>>>>> *2. Download the data:*
> >>>>>>>>>>> Download data and copy it to
> >>>>>>>>> https://developers.google.com/freebase/data
> >>>>>>>>>>>
> >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
> >>>>>>>>>>> It generated incoming_links.txt under resources directory as
> >>>>> follows
> >>>>>>>>>>>
> >>>>>>>>>>> 10888430 m.0kpv11
> >>>>>>>>>>> 3741261 m.019h
> >>>>>>>>>>> 2667858 m.0775xx5
> >>>>>>>>>>> 2667804 m.0775xvm
> >>>>>>>>>>> 1875352 m.01xryvm
> >>>>>>>>>>> 1739262 m.05zppz
> >>>>>>>>>>> 1369590 m.01xrzlb
> >>>>>>>>>>>
> >>>>>>>>>>> *4. Performed execution of fixit script*
> >>>>>>>>>>>
> >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
> >>>>>>>>>>>
> >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
> >>>>>>>>>>> to indexing/resources/rdfdata
> >>>>>>>>>>>
> >>>>>>>>>>> *6. config/iditer.properties file has following setting*
> >>>>>>>>>>> #id-namespace=http://freebase.com/
> >>>>>>>>>>> ns-prefix-state=false
> >>>>>>>>>>>
> >>>>>>>>>>> *7. Performed run of following command:*
> >>>>>>>>>>> java -jar -Xmx32g
> >>>>>>>>>>>
> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> >>>>>>> index
> >>>>>>>>>>>
> >>>>>>>>>>> The error dump on stdout is as follows:
> >>>>>>>>>>>
> >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO
> >>>>> solryard.SolrYardIndexingDestination -
> >>>>>>>>> ...
> >>>>>>>>>>> copy Solr Configuration form
> >>>>>>>>> /private/tmp/freebase/indexing/config/freebase
> >>>>>>>>>>> to
> >>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase
> >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -
>  -
> >>>>>>> bulk
> >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >>>>> Start
> >>>>>>>>>>> triples data phase
> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
> >>>>> Load
> >>>>>>>>> empty
> >>>>>>>>>>> triples table
> >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
> >>>>>>> Element or
> >>>>>>>>>>> attribute do not match QName production:
> >>>>> QName::=(NCName':')?NCName.*
> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >>>>> Finish
> >>>>>>>>>>> triples data phase
> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
> >>>>> Finish
> >>>>>>>>>>> triples load
> >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore
> Error
> >>>>>>> for
> >>>>>>>>> File
> >>>>>>>>>>>
> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
> >>>>> and
> >>>>>>>>>>> continue
> >>>>>>>>>>>
> >>>>>>>>>>> Additional Reference Point:
> >>>>>>>>>>>
> >>>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
> >>>>>>>>>>> freebase-rdf-latest.gz
> >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
> >>>>>>>>>>> freebase-rdf-latest-fixed.gz
> >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42
> incoming_links.txt
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> ------------------------------
> >>>>>>>>>> This message should be regarded as confidential. If you have
> >>>>> received
> >>>>>>>>> this
> >>>>>>>>>> email in error please notify the sender and destroy it
> immediately.
> >>>>>>>>>> Statements of intent shall only become binding when confirmed in
> >>>>> hard
> >>>>>>>>> copy
> >>>>>>>>>> by an authorised signatory.
> >>>>>>>>>>
> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the
> registration
> >>>>>>> number
> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds
> Bush
> >>>>>>> Road,
> >>>>>>>>>> London W6 7AN.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> | Rupert Westenthaler             [email protected]
> >>>>>>>>> | Bodenlehenstraße 11
> ++43-699-11108907
> >>>>>>>>> | A-5500 Bischofshofen
> >>>>>>>>> | REDLINK.CO
> >>>>>
> ..........................................................................
> >>>>>>>>> | http://redlink.co/
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> | Rupert Westenthaler             [email protected]
> >>>>>>> | Bodenlehenstraße 11
> ++43-699-11108907
> >>>>>>> | A-5500 Bischofshofen
> >>>>>>> | REDLINK.CO
> >>>>>
> ..........................................................................
> >>>>>>> | http://redlink.co/
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> | Rupert Westenthaler             [email protected]
> >>>>> | Bodenlehenstraße 11                              ++43-699-11108907
> >>>>> | A-5500 Bischofshofen
> >>>>> | REDLINK.CO
> >>>>>
> ..........................................................................
> >>>>> | http://redlink.co/
> >>>
> >>>
> >>>
> >>> --
> >>> | Rupert Westenthaler             [email protected]
> >>> | Bodenlehenstraße 11                              ++43-699-11108907
> >>> | A-5500 Bischofshofen
> >>> | REDLINK.CO
> ..........................................................................
> >>> | http://redlink.co/
> >
> >
> >
> > --
> > | Rupert Westenthaler             [email protected]
> > | Bodenlehenstraße 11                              ++43-699-11108907
> > | A-5500 Bischofshofen
> > | REDLINK.CO
> ..........................................................................
> > | http://redlink.co/
>

Re: Entityhub indexing for Freebase data

Reply via email to