I have created a tool in python to extract and transform UNIHAN database's information. It’s open source (MIT-licensed) and offers users customized outputs. It’s documented extensively at https://unihan-etl.git-pull.com. In addition, the project’s source code can be found at https://github.com/cihai/unihan-etl.
I paired off this tool due to the time-effort of studying the fields and extracting the information correctly. The hope is that one day a traveller going down the same path can find this useful. It has been mentioned before on this list at least once, back in 2004: http://unicode.org/mail-arch/unicode-ml/y2004-m04/0255.html > I'm trying to pare Unihan.txt down to a less unwieldy size for my own use by eliminating properties that are of no interest to me and would like to be certain that eliminating the four properties containing the actual values for those dictionaries can be done safely because the information can be reconstituted if necessary from the kIRG* properties since I'm not certain if those properties are of interest to me. There are developers who may only want to extract a pre-determined set of fields. $ pip install —user unihan-etl And create an export values into a CSV (UNIHAN downloads automatically): $ unihan-etl Only pull custom fields (once downloaded, Unihan.zip is cached for reuse): $ unihan-etl -f kMandarin kNelson kMorohashi Will only pull out those fields. Let’s get a structured output in JSON (empty values are pruned automatically): $ unihan-etl -f kMandarin kNelson kMorohashi -F json Also, with pyyaml you can use -F yaml, as well. $ pip install pyyaml $ unihan-etl -f kMandarin kNelson kMorohashi -F yaml To see all the command line options: http://unihan-etl.git-pull.com/en/latest/cli.html Container format: To keep that data exports as portable as possible, it follows the Data Packages standard ( http://frictionlessdata.io/data-packages/). This is a trickier data set since fields compact quite a bit of detail in them. Other data sets such as CEDict will also be made available as data packages. Backstory: I am trying to create a spiritual successor to cjklib ( https://pypi.python.org/pypi/cjklib). The project aims to pull in CJK datasets and make them accessible under one library. Datasets are also going to be available a la carte via a consistent data standard (Data Packages). I am opting to use UNIHAN database as a core of the CJK data sources. The project’s homepage is https://cihai.git-pull.com.