Adam Flinton wrote: > Byte Order Marks > > What is the default CharSet for OpenEHR ADL? > > ASCII? UTF-8? >
Hi Adam, UTF-8 is the preferred. See section 3 of http://www.openehr.org/releases/1.0.1/architecture/am/adl.pdf > I ask because ADL itself does not anywhere declare a character set & we > have had a number of adl files which have failed (either to be opened or > to be transformed into XML) & in each occasion the reason has been the > presence of a byte order mark (hex bytes EF BB BF) e.g. > > Exception in thread "main" se.acode.openehr.parser.TokenMgrError: > Lexical error at line 1, column 1. Encountered: "\u00ef" (239), after : "" > at > se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554) > > > Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM > then the Archetype editor dies in addition to the Java ADL parser & the > Windows ADL2XML converter. > BOMs should not be used in UTF-8 files, they are only for UTF-16 files. Lots of broken programs unfortunately seem to add a BOM for UTF-8, but it breaks things on unix, or unix-like environments. The ADL parser in the ADL Workbench detects BOMs and ignores them. You can try the unicode/family_history archetype in the dev/test area on SVN in the ADL Workbench - it is in farsi, to see unicode working. It works in all languages we have tested. If you really want proof (;-) you can see how the ADL parser (used inside the Archetype Editor and ADL workbench) does its character matching - see http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/structures/syntax/dadl/parser/dadl_scanner.l and http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/components/adl_parser/src/syntax/cadl/parser/cadl_scanner.l these are the dADL and cADL lexers respectively. They parse all strings in a UTF-8 aware way. In addition, you can see here where the BOMs are removed from ADL files, if found - http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/file_system/file_context.e > > Our standard for XML is UTF-8 & I am wondering that if the std is in ADL > ASCII then how does/will adl support extended charater sets? > > e.g. in one of our adl there is some Dutch including > > "Een pati?nt in rolstoel moet zonder hulp met hoeken en deuren kunnen > omgaan" where "pati?nt" is mis-rendered (though given mail clients are > pretty good this will probably be correctly rendered) > See above, there is no problem in ADL files - they support UTF-8 and have been tested extensively. If there are problems in other tools, we need to look carefully at the actions that lead to the problems. I can't answer off the top of my head whether there are cut and paste problems in the Archetype Editor, but I don't believe there should be any problems in its ADL serialisation. Perhaps there are inconsistencies in the XML serialiser - it is a separate piece of code. We should certainly get it fixed at source if there are any problems, but I am sure they will be in the code, not the archetype files themselves. - thomas beale

