Byte Order Marks

Thomas Beale Sat, 25 Oct 2008 12:05:41 +0100

Adam Flinton wrote:
> Byte Order Marks
>
> What is the default CharSet for OpenEHR ADL?
>
> ASCII? UTF-8?
>


Hi Adam,

UTF-8 is the preferred. See section 3 of 
http://www.openehr.org/releases/1.0.1/architecture/am/adl.pdf
> I ask because ADL itself does not anywhere declare a character set & we 
> have had a number of adl files which have failed (either to be opened or 
> to be transformed into XML) & in each occasion the reason has been the 
> presence of a byte order mark (hex bytes EF BB BF) e.g.
>
> Exception in thread "main" se.acode.openehr.parser.TokenMgrError: 
> Lexical error at line 1, column 1.  Encountered: "\u00ef" (239), after : ""
>     at 
> se.acode.openehr.parser.ADLParserTokenManager.getNextToken(ADLParserTokenManager.java:27554)
>
>
> Equally if a text editor opens an ADL, assumes UTF-8 & puts on a BOM 
> then the Archetype editor dies in addition to the Java ADL parser & the 
> Windows ADL2XML converter.
>   

BOMs should not be used in UTF-8 files, they are only for UTF-16 files. 
Lots of broken programs unfortunately seem to add a BOM for UTF-8, but 
it breaks things on unix, or unix-like environments. The ADL parser in 
the ADL Workbench detects BOMs and ignores them. You can try the 
unicode/family_history archetype in the dev/test area on SVN in the ADL 
Workbench - it is in farsi, to see unicode working. It works in all 
languages we have tested.

If you really want proof (;-) you can see how the ADL parser (used 
inside the Archetype Editor and ADL workbench) does its character 
matching - see  
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/structures/syntax/dadl/parser/dadl_scanner.l
 
and 
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/components/adl_parser/src/syntax/cadl/parser/cadl_scanner.l

these are the dADL and cADL lexers respectively. They parse all strings 
in a UTF-8 aware way. In addition, you can see here where the BOMs are 
removed from ADL files, if found - 
http://www.openehr.org/svn/ref_impl_eiffel/BRANCHES/specialisation/libraries/common_libs/src/file_system/file_context.e

>   
> Our standard for XML is UTF-8 & I am wondering that if the std is in ADL 
> ASCII then how does/will adl support extended charater sets?
>
> e.g. in one of our adl there is some Dutch including     
>
> "Een pati?nt in rolstoel moet zonder hulp met hoeken en deuren kunnen 
> omgaan" where "pati?nt" is mis-rendered (though given mail clients are 
> pretty good this will probably be correctly rendered)
>   

See above, there is no problem in ADL files - they support UTF-8 and 
have been tested extensively. If there are problems in other tools, we 
need to look carefully at the actions that lead to the problems. I can't 
answer off the top of my head whether there are cut and paste problems 
in the Archetype Editor, but I don't believe there should be any 
problems in its ADL serialisation. Perhaps there are inconsistencies in 
the XML serialiser - it is a separate piece of code. We should certainly 
get it fixed at source if there are any problems, but I am sure they 
will be in the code, not the archetype files themselves.

- thomas beale

Byte Order Marks

Reply via email to