Hi Sawyer, Yes, the implementation of STANBOL-855 is finished and processing of Chinese texts does work. But Im still working on some parts. Because of this I have not announced this on the mailing lists yet. In the following I will provide information for those who want to give it a try.
Feedback very welcome! On Mon, Dec 31, 2012 at 3:42 AM, Sawyer Chen <[email protected]> wrote: > Dear all, > > I have seen that STANBOL-855 has been resolved and does it means basic > chinese support now is possible? Do I need to do any configures to enable > this feature(chinese support)? For enhancing Chinese text you need to do the following. 1. Include the bundles referenced by the smartcn bundlelist [1]. The best way to use this is to add this bundlelist to your launcher configuration as explained by [2]. However if you like you can also manually install the three bundles referenced by the list.xml file [3]. 2. Ensure that the Solr Index is configured to use the smartcn analyzers for indexing Chinese text. The README.md file within the [1] directory provides details on that. 3. Configure the EnhancementChain to include the "smartcn-token" engine. In addition you should configure the 'opennlp-token' engine (search for "OpenNLP Tokenizer" in http://localhost:8080/system/console/configMgr) to ignore Chinese texts by adding "!zh" in an additional line of the "Language configuration" property. A typical EnhancementChain could look like tika;optional langdetect opennlp-sentence opennlp-token smartcn-token opennlp-pos opennlp-chunker {entityhublinking} If you want to just process Chinese texts you can skip all "apennlp-*" engines. The {entityhublinking} refers to an EntityhubLinkingEngine [4] configured for your vocabulary managed in an Entityhub Site. You will just need to configure the name and site. For the rest the default values should be fine. I would recommend to use the "Weighted Chain" implementation for configuring this Chain. BTW: I am also testing an alternative way for processing Chinese based on padding [5]. But this framework is implemented in a way that it is really hard to get it running within an OSGI environment. So while I have found several resources that claim that paoding does give better results as smartcn this might take some more time to get it running. > Or do I need to download additional dbpeida index including chinese info? Sorry for now there is no Chinese dbpedia index available. I am still working on that part (e.g. just yesterday I fixed STANBOL-869 that was really hurting the indexing process for Chinese dbpedia). If you want to try building your own Chinese dbpedia index you should have a look at the utilities in [6] 0. copy the Dbpedia indexing tool (see [7] how to build it) to an indexing working directory and than initialize the default configuration by calling 'java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init' 1. call './entityrankings.sh zh" (part of [6]) as this will create an file with the incoming links for the Chinese dbpedia. You will need to rename and copy the resulting file to 'indexing/resource/incoming_links.txt'. 2. adapt the 'fetch_data_de.sh' for Chinese. Basically keep all English stuff and replace 'de' with 'zh'. Some Chinese files will be missing because dbpedia is missing some information for the Chinese version. Just exclude such files. Make sure to execute this script in the indexing workspace directory, because this will ensure that the downloaded and pre-processed files are copied to the 'indexing/resources/rdfdata' directory. 3. add the LDpath source processor to the Entityhub Indexingtool configuration and configure it to use 'copy_en_values.ldpath'. This will ensure that knowledge present in the English DBpedia version is copied for those Chinese dbpedia Entities that do define an interlanugage link to the English version. To add the LDpath source processor you will need to change the value of the "entityProcessor" parameter in the 'indexing/config/indexing.properties' file. The following value should be fine. entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath: copy_en_values.ldpath;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor Make sure to copy the 'copy_en_values.ldpath' file into the 'indexing/config' directory. 4. Make sure to use the smartcn analyzers for indexing Chinese labels and comments. How to do this is explained by the "README.md" in [1] section "Usage with the EntityhubIndexing Tool" 5. now you can start the indexing process by calling java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar index The resulting index will require the smartcn bundlelist to be installed. Otherwise during the initialization you will see a error in the log noting that the smartcn analyzers can not be instantiated. best Rupert Westenthaler [1] http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/ [2] http://stanbol.apache.org/production/your-launcher#dependencies-to-bundlelist [3] http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/src/main/bundles/list.xml [4] http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entityhublinking [5] http://code.google.com/p/paoding/ [6] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/ [7] http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.md -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
