Dear Rupert, Thanks for the quick reply and detailed instructions and also the hard work behind this! I will try the guides later and get back to you any feedback. BTW, I am here creating an custom CMS(database) about music research and very interesting in integrating semantic technology. Happy New Year!
Sawyer Chen (From Beijing, China) 2012/12/31 Rupert Westenthaler <[email protected]> > Hi Sawyer, > > Yes, the implementation of STANBOL-855 is finished and processing of > Chinese texts does work. But Im still working on some parts. Because > of this I have not announced this on the mailing lists yet. In the > following I will provide information for those who want to give it a > try. > > Feedback very welcome! > > On Mon, Dec 31, 2012 at 3:42 AM, Sawyer Chen <[email protected]> > wrote: > > Dear all, > > > > I have seen that STANBOL-855 has been resolved and does it means basic > > chinese support now is possible? Do I need to do any configures to enable > > this feature(chinese support)? > > For enhancing Chinese text you need to do the following. > > 1. Include the bundles referenced by the smartcn bundlelist [1]. The > best way to use this is to add this bundlelist to your launcher > configuration as explained by [2]. However if you like you can also > manually install the three bundles referenced by the list.xml file > [3]. > > 2. Ensure that the Solr Index is configured to use the smartcn > analyzers for indexing Chinese text. The README.md file within the [1] > directory provides details on that. > > 3. Configure the EnhancementChain to include the "smartcn-token" > engine. In addition you should configure the 'opennlp-token' engine > (search for "OpenNLP Tokenizer" in > http://localhost:8080/system/console/configMgr) to ignore Chinese > texts by adding "!zh" in an additional line of the "Language > configuration" property. > > A typical EnhancementChain could look like > > tika;optional > langdetect > opennlp-sentence > opennlp-token > smartcn-token > opennlp-pos > opennlp-chunker > {entityhublinking} > > If you want to just process Chinese texts you can skip all "apennlp-*" > engines. The {entityhublinking} refers to an EntityhubLinkingEngine > [4] configured for your vocabulary managed in an Entityhub Site. You > will just need to configure the name and site. For the rest the > default values should be fine. > > I would recommend to use the "Weighted Chain" implementation for > configuring this Chain. > > > BTW: I am also testing an alternative way for processing Chinese based > on padding [5]. But this framework is implemented in a way that it is > really hard to get it running within an OSGI environment. So while I > have found several resources that claim that paoding does give better > results as smartcn this might take some more time to get it running. > > > Or do I need to download additional dbpeida index including chinese info? > > Sorry for now there is no Chinese dbpedia index available. I am still > working on that part (e.g. just yesterday I fixed STANBOL-869 that was > really hurting the indexing process for Chinese dbpedia). If you want > to try building your own Chinese dbpedia index you should have a look > at the utilities in [6] > > 0. copy the Dbpedia indexing tool (see [7] how to build it) to an > indexing working directory and than initialize the default > configuration by calling 'java -jar > org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar > init' > 1. call './entityrankings.sh zh" (part of [6]) as this will create an > file with the incoming links for the Chinese dbpedia. You will need to > rename and copy the resulting file to > 'indexing/resource/incoming_links.txt'. > 2. adapt the 'fetch_data_de.sh' for Chinese. Basically keep all > English stuff and replace 'de' with 'zh'. Some Chinese files will be > missing because dbpedia is missing some information for the Chinese > version. Just exclude such files. Make sure to execute this script in > the indexing workspace directory, because this will ensure that the > downloaded and pre-processed files are copied to the > 'indexing/resources/rdfdata' directory. > 3. add the LDpath source processor to the Entityhub Indexingtool > configuration and configure it to use 'copy_en_values.ldpath'. This > will ensure that knowledge present in the English DBpedia version is > copied for those Chinese dbpedia Entities that do define an > interlanugage link to the English version. To add the LDpath source > processor you will need to change the value of the "entityProcessor" > parameter in the 'indexing/config/indexing.properties' file. The > following value should be fine. > > > entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath: > > copy_en_values.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor > > Make sure to copy the 'copy_en_values.ldpath' file into the > 'indexing/config' directory. > > 4. Make sure to use the smartcn analyzers for indexing Chinese labels > and comments. How to do this is explained by the "README.md" in [1] > section "Usage with the EntityhubIndexing Tool" > > 5. now you can start the indexing process by calling java -jar > org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar > index > > The resulting index will require the smartcn bundlelist to be > installed. Otherwise during the initialization you will see a error in > the log noting that the smartcn analyzers can not be instantiated. > > best > Rupert Westenthaler > > > [1] > http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/ > [2] > http://stanbol.apache.org/production/your-launcher#dependencies-to-bundlelist > [3] > http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/src/main/bundles/list.xml > [4] > http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking > [5] http://code.google.com/p/paoding/ > [6] > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/ > [7] > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.md > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
