Re: [antlr-dev] UNICODE file input for C Runtime

Jim Idle Thu, 18 Mar 2010 14:23:15 -0700

This is  not a released version - I have not finished that stuff yet. However, 
if you are not using these things yourself, you should not need to worry about 
it. There should not be any direct dependence on the STRINGs even if something 
tries to set one up. I sometimes wish I had never written them to be honest ;-) 
They only appear if you ask for  $T.text.




You can just copy the 8 bit methods for UTF-8 and so on so that things will 
work. The filestream will have the file name wrong perhaps but that should not 
really matter.



Jim



From: Goins, John C (IS) [mailto:[email protected]]
Sent: Thursday, March 18, 2010 2:09 PM
To: Jim Idle; [email protected]
Subject: RE: [antlr-dev] UNICODE file input for C Runtime



Jim -



I'm not sure how to proceed.  Internally the string functions seem to be used 
in various places (I'm not using them in any of my code).  Do you think I 
should just make UTF8 functions and attach them?  If this isn't fixed, no one 
will be able to read in UTF8 files for this version, since these methods that 
are NULL when you read in a UTF8 file, and are called internally.



Thanks



From: [email protected] [mailto:[email protected]] On 
Behalf Of Jim Idle
Sent: Thursday, March 18, 2010 4:55 PM
To: [email protected]
Subject: Re: [antlr-dev] UNICODE file input for C Runtime



I have not supplied string methods for those encodings I am afraid, I did not 
have time. But the string stuff is just a convenience method - for performance 
you should just use the pointers in the tokens.



Jim







From: Goins, John C (IS) [mailto:[email protected]]
Sent: Thursday, March 18, 2010 1:45 PM
To: Jim Idle; [email protected]
Subject: RE: [antlr-dev] UNICODE file input for C Runtime



Jim -



Thanks, I've integrated this release and used it successfully with UTF 16 and 
ASCII (8 bit) files so far in limited testing.  However, I'm having problems 
with UTF8.  I tracked the problem down to the function antlr3StringFactoryNew() 
inside antlr2string.c.  The case statement only sets the API for UTF16 and 
8BIT.  I can make some more API functions for the rest, if that's all that's 
missing.  I suspect you may have already done so, though.  I believe the case 
statement will need to be filled for all the various types before releasing 
this version, unless I am missing something.  An error occurs in 
antlr3filestream.c line 81 when loading UTF8 files because the newStr8 function 
is null for the input stream.



John



From: [email protected] [mailto:[email protected]] On 
Behalf Of Jim Idle
Sent: Monday, February 22, 2010 4:18 PM
To: [email protected]
Subject: Re: [antlr-dev] UNICODE file input for C Runtime



I think you mean 'standard Unicode encodings' rather than Unicode ;-)



This is built in to the next release, I just have not had time to get to doing 
the actual release. You can get the new sources from



http://fisheye2.atlassian.com/browse/antlr



though they are not well tested as of yet. You can also get a perforce login 
from Terence, or use the git mirror at: http://github.com/antlr You will need 
to read through the new source to use it as I have not had time to update the 
docs yet either.





Jim



From: Goins, John C (IS) [mailto:[email protected]]
Sent: Monday, February 22, 2010 12:40 PM
To: Jim Idle; [email protected]
Subject: RE: [antlr-dev] UNICODE file input for C Runtime



I was wondering if there were source code or a C-Runtime update available yet 
that handled loading UNICODE files in the C-Runtime?  If, so, where can I grab 
them from. Is the next release a 3.x version or will it be 4.x?  TIA





From: [email protected] [mailto:[email protected]] On 
Behalf Of Jim Idle
Sent: Wednesday, January 06, 2010 7:43 PM
To: [email protected]
Subject: Re: [antlr-dev] UNICODE file input for C Runtime



You should find sample C code by searching antlr.markmail.org



However if you can wait a few weeks then the next release will support a 
universal input stream that processes BOM and supports UTF8, UTF16, UTF32, 
ASCII/8bit and EBCDIC.



Jim



From: [email protected] [mailto:[email protected]] On 
Behalf Of Goins, John C (IS)
Sent: Wednesday, January 06, 2010 3:38 PM
To: [email protected]
Subject: [antlr-dev] UNICODE file input for C Runtime



I've found ANTLR very useful as a language parser for my application, but I now 
have a requirement to use UNICODE files as input.  I'm using the C runtime 
since my application is written in C. I hope someone can help me with a couple 
of questions.

There are two bytes at the beginning of a UNICODE file. My application will be 
run on multiple platforms (Java wasn't an option) and I will need to interpret 
the UNICODE BOM (byte order mark) since I don't think ANTLR uses this, is that 
correct?  I can write a function to always set the order to one particular way 
(the input files could come from different architecture machines) by reading 
the BOM myself. I think that is a correct approach, unless there is something 
in the ANTLR C Runtime that can help.

I've read about how I need to convert a UNICODE file to UTF-32 and use the UCS2 
input functions, but I've had little to no success in doing so.  I get lots of 
errors or things just don't parse. Does anyone have sample C code that 
accomplishes this? Or even the functions that I should use and order in which 
to call them?

TIA

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] UNICODE file input for C Runtime

Reply via email to