Re: About STANBOL-855 chinese support

Rupert Westenthaler Mon, 31 Dec 2012 03:51:02 -0800

Hi Sawyer,

Yes, the implementation of STANBOL-855 is finished and processing of
Chinese texts does work. But Im still working on some parts. Because
of this I have not announced this on the mailing lists yet. In the
following I will provide information for those who want to give it a
try.

Feedback very welcome!

On Mon, Dec 31, 2012 at 3:42 AM, Sawyer Chen <[email protected]> wrote:
> Dear all,
>
> I have seen that STANBOL-855 has been resolved and does it means basic
> chinese support now is possible? Do I need to do any configures to enable
> this feature(chinese support)?

For enhancing Chinese text you need to do the following.

1. Include the bundles referenced by the smartcn bundlelist [1]. The
best way to use this is to add this bundlelist to your launcher
configuration as explained by [2]. However if you like you can also
manually install the three bundles referenced by the list.xml file
[3].

2. Ensure that the Solr Index is configured to use the smartcn
analyzers for indexing Chinese text. The README.md file within the [1]
directory provides details on that.

3. Configure the EnhancementChain to include the "smartcn-token"
engine. In addition you should configure the 'opennlp-token' engine
(search for "OpenNLP Tokenizer" in
http://localhost:8080/system/console/configMgr) to ignore Chinese
texts by adding "!zh" in an additional line of the "Language
configuration" property.

A typical EnhancementChain could look like

tika;optional
langdetect
opennlp-sentence
opennlp-token
smartcn-token
opennlp-pos
opennlp-chunker
{entityhublinking}

If you want to just process Chinese texts you can skip all "apennlp-*"
engines. The {entityhublinking} refers to an EntityhubLinkingEngine
[4] configured for your vocabulary managed in an Entityhub Site. You
will just need to configure the name and site. For the rest the
default values should be fine.

I would recommend to use the "Weighted Chain" implementation for
configuring this Chain.

BTW: I am also testing an alternative way for processing Chinese based
on padding [5]. But this framework is implemented in a way that it is
really hard to get it running within an OSGI environment. So while I
have found several resources that claim that paoding does give better
results as smartcn this might take some more time to get it running.

> Or do I need to download additional dbpeida index including chinese info?

Sorry for now there is no Chinese dbpedia index available. I am still
working on that part (e.g. just yesterday I fixed STANBOL-869 that was
really hurting the indexing process for Chinese dbpedia). If you want
to try building your own Chinese dbpedia index you should have a look
at the utilities in [6]

0. copy the Dbpedia indexing tool (see [7] how to build it) to an
indexing working directory and than initialize the default
configuration by calling 'java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
init'
1. call './entityrankings.sh zh" (part of [6]) as this will create an
file with the incoming links for the Chinese dbpedia. You will need to
rename and copy the resulting file to
'indexing/resource/incoming_links.txt'.
2. adapt the 'fetch_data_de.sh' for Chinese. Basically keep all
English stuff and replace 'de' with 'zh'. Some Chinese files will be
missing because dbpedia is missing some information for the Chinese
version. Just exclude such files. Make sure to execute this script in
the indexing workspace directory, because this will ensure that the
downloaded and pre-processed files are copied to the
'indexing/resources/rdfdata' directory.
3. add the LDpath source processor to the Entityhub Indexingtool
configuration and configure it to use 'copy_en_values.ldpath'. This
will ensure that knowledge present in the English DBpedia version is
copied for those Chinese dbpedia Entities that do define an
interlanugage link to the English version. To add the LDpath source
processor you will need to change the value of the "entityProcessor"
parameter in the 'indexing/config/indexing.properties' file. The
following value should be fine.

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:
copy_en_values.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

Make sure to copy the 'copy_en_values.ldpath' file into the
'indexing/config' directory.

4. Make sure to use the smartcn analyzers for indexing Chinese labels
and comments. How to do this is explained by the "README.md" in [1]
section "Usage with the EntityhubIndexing Tool"

5. now you can start the indexing process by calling java -jar
org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar
index

The resulting index will require the smartcn bundlelist to be
installed. Otherwise during the initialization you will see a error in
the log noting that the smartcn analyzers can not be instantiated.

best
Rupert Westenthaler

[1]
http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/
[2]
http://stanbol.apache.org/production/your-launcher#dependencies-to-bundlelist
[3]
http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/src/main/bundles/list.xml
[4]
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[5] http://code.google.com/p/paoding/
[6]
http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/
[7]
http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.md

--
| Rupert Westenthaler [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen

Re: About STANBOL-855 chinese support

Reply via email to