Re: About STANBOL-855 chinese support

Sawyer Chen Mon, 31 Dec 2012 05:24:56 -0800

Dear Rupert,

Thanks for the quick reply and detailed instructions and also the hard work
behind this!
I will try the guides later and get back to you any feedback.
BTW, I am here creating an custom CMS(database) about music research and
very interesting in integrating semantic technology.
Happy New Year!


Sawyer Chen (From Beijing, China)



2012/12/31 Rupert Westenthaler <[email protected]>

> Hi Sawyer,
>
> Yes, the implementation of STANBOL-855 is finished and processing of
> Chinese texts does work. But Im still working on some parts. Because
> of this I have not announced this on the mailing lists yet. In the
> following I will provide information for those who want to give it a
> try.
>
> Feedback very welcome!
>
> On Mon, Dec 31, 2012 at 3:42 AM, Sawyer Chen <[email protected]>
> wrote:
> > Dear all,
> >
> > I have seen that STANBOL-855 has been resolved and does it means basic
> > chinese support now is possible? Do I need to do any configures to enable
> > this feature(chinese support)?
>
> For enhancing Chinese text you need to do the following.
>
> 1. Include the bundles referenced by the smartcn bundlelist [1]. The
> best way to use this is to add this bundlelist to your launcher
> configuration as explained by [2]. However if you like you can also
> manually install the three bundles referenced by the list.xml file
> [3].
>
> 2. Ensure that the Solr Index is configured to use the smartcn
> analyzers for indexing Chinese text. The README.md file within the [1]
> directory provides details on that.
>
> 3. Configure the EnhancementChain to include the "smartcn-token"
> engine. In addition you should configure the 'opennlp-token' engine
> (search for "OpenNLP Tokenizer" in
> http://localhost:8080/system/console/configMgr) to ignore Chinese
> texts by adding "!zh" in an additional line of the "Language
> configuration" property.
>
> A typical EnhancementChain could look like
>
>     tika;optional
>     langdetect
>     opennlp-sentence
>     opennlp-token
>     smartcn-token
>     opennlp-pos
>     opennlp-chunker
>     {entityhublinking}
>
> If you want to just process Chinese texts you can skip all "apennlp-*"
> engines. The {entityhublinking} refers to an EntityhubLinkingEngine
> [4] configured for your vocabulary managed in an Entityhub Site. You
> will just need to configure the name and site. For the rest the
> default values should be fine.
>
> I would recommend to use the "Weighted Chain" implementation for
> configuring this Chain.
>
>
> BTW: I am also testing an alternative way for processing Chinese based
> on padding [5]. But this framework is implemented in a way that it is
> really hard to get it running within an OSGI environment. So while I
> have found several resources that claim that paoding does give better
> results as smartcn this might take some more time to get it running.
>
> > Or do I need to download additional dbpeida index including chinese info?
>
> Sorry for now there is no Chinese dbpedia index available. I am still
> working on that part (e.g. just yesterday I fixed STANBOL-869 that was
> really hurting the indexing process for Chinese dbpedia). If you want
> to try building your own Chinese dbpedia index you should have a look
> at the utilities in [6]
>
> 0. copy the Dbpedia indexing tool (see [7] how to build it) to an
> indexing working directory and than initialize the default
> configuration by calling 'java -jar
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
> init'
> 1. call './entityrankings.sh zh" (part of [6]) as this will create an
> file with the incoming links for the Chinese dbpedia. You will need to
> rename and copy the resulting file to
> 'indexing/resource/incoming_links.txt'.
> 2. adapt the 'fetch_data_de.sh' for Chinese. Basically keep all
> English stuff and replace 'de' with 'zh'. Some Chinese files will be
> missing because dbpedia is missing some information for the Chinese
> version. Just exclude such files. Make sure to execute this script in
> the indexing workspace directory, because this will ensure that the
> downloaded and pre-processed files are copied to the
> 'indexing/resources/rdfdata' directory.
> 3. add the LDpath source processor to the Entityhub Indexingtool
> configuration and configure it to use 'copy_en_values.ldpath'. This
> will ensure that knowledge present in the English DBpedia version is
> copied for those Chinese dbpedia Entities that do define an
> interlanugage link to the English version. To add the LDpath source
> processor you will need to change the value of the "entityProcessor"
> parameter in the 'indexing/config/indexing.properties' file. The
> following value should be fine.
>
>
> entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:
>
> copy_en_values.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor
>
> Make sure to copy the 'copy_en_values.ldpath' file into the
> 'indexing/config' directory.
>
> 4. Make sure to use the smartcn analyzers for indexing Chinese labels
> and comments. How to do this is explained by the "README.md" in [1]
> section "Usage with the EntityhubIndexing Tool"
>
> 5. now you can start the indexing process by calling java -jar
> org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
> index
>
> The resulting index will require the smartcn bundlelist to be
> installed. Otherwise during the initialization you will see a error in
> the log noting that the smartcn analyzers can not be instantiated.
>
> best
> Rupert Westenthaler
>
>
> [1]
> http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/
> [2]
> http://stanbol.apache.org/production/your-launcher#dependencies-to-bundlelist
> [3]
> http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/src/main/bundles/list.xml
> [4]
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
> [5] http://code.google.com/p/paoding/
> [6]
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/
> [7]
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.md
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: About STANBOL-855 chinese support

Reply via email to