Hi Jose

On 07/09/12 02:06, Jose Joao Dias de Almeida wrote:


On 09/05/2012 11:25 PM, Ron Savage wrote:
Hi Jose

On 05/09/12 22:07, Jose Joao Dias de Almeida wrote:
Dear Gedcom-ers,
I just star with Gedcom.pm and things are beginning to work!

But I have problems with files in Unicode.

When files are in utf8 + BOM --> it returns error in first line (the
BOM)

If I remove the BOM and try again, apparently it does not pay attention
to "1 CHAR UTF-8"

Is there any extra thing to say in this cases?
Um abraço
J.Joao

0 HEAD
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED

1 LANG Portuguese
1 SOUR MYHERITAGE
...

I just checked the source code of Gedcom.pm V 1.16, and the only
reference to utf8 is on line 390, where it is writing an XML file.

So, you're right, of course.

What do you think should happen?

Perhaps if the code detects 1 CHAR UTF-8 the input file should be closed
and re-opened in utf-8 mode, yes?

I think that would solve the problem.

Probably a simple
if(/1 CHAR UTF-8/){ binmode(...) }
would also work.

I believe binmode does not work after the file has been read. That is, it must be executed immediately after opening the file, before the 1st read.

But see File::BOM. Gedcom needs to use that.

The string 1 CHAR UTF-8 obviously must be read in before knowing the file is intended to be utf8, so File::BOM won't solve that problem.

One extra thing:
Some of the unicode files sometime include the initial byte order marker
(BOM) in order to sign the unicode format used.

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

it would be nice if we could at least skip/ignore them (for exemple
Myheritage tools are generating gedcom files with BOMs)

eg:

$ged =~
s/^(\x00\x00\xFE\xFF|\xFF\xFE\x00\x00|\xFF\xFE|\xFE\xFF|\xEF\xBB\xBF)//;
## remove BOM !

Um abraço
J.Joao





--
Ron Savage
http://savage.net.au/
Ph: 0421 920 622

Reply via email to