On what platform do you *compile* code?  I'm hoping you compile on an ASCII
platform and only need the code to run under EBCDIC.  If that's the case,
then you do not need to make protoc itself run on EBCDIC, only the code it
generates.

tokenizer.cc is only used by protoc (to parse proto files) and TextFormat.
 If you build on an ASCII machine and don't need to parse TextFormat then
you can ignore tokenizer.cc.  This makes the problem a lot simpler.  I think
all you have to do is fix this code:

http://code.google.com/p/protobuf/source/browse/tags/2.1.0/src/google/protobuf/compiler/cpp/cpp_file.cc#445

The code takes the FileDescriptorProto for the proto file, encodes it, and
embeds the bytes directly into the generated code as a giant string literal.
 Since these bytes are actually binary data, you need to make sure no
character set conversion occurs on them.  I'm guessing you could do that by
escaping every character as an octal sequence, rather than call CEscape()
which only escapes unprintable characters and quotes.

Unfortunately, without TextFormat, many of the unit tests won't work.

** alternative strategy **

Note that tokenizer.cc does not use ctype.h because its behavior differs in
different locales.  If we used ctype.h, then text which is parseable when
the locale is English might not be parseable when the locale is, say,
Russian.  This is confusing, because in both cases the text being parsed is
its own language and should have nothing to do with the user's spoken
language.

So what if we extend the same reasoning to character sets?  Let's say that
.proto files and TextFormat are ASCII formats and are not defined for any
other character set.  If you want to parse text that is in some other
character set, you must convert it to ASCII first, then feed it to the
tokenizer.  One could easily write a ZeroCopyInputStream implementation
which accomplishes this conversion.

So now the problem is -- conceptually, at least -- much easier.  What you
want to do is set your C++ compiler so that all string and character
literals in the protobuf code are kept in ASCII format, not converted to
EBCDIC.  Thus, all the code operates on ASCII, even when running on a system
where EBCDIC is standard, so code like ('a' <= c && c <= 'z') will still
work, so long as c is an ASCII character.  Note that you would only have to
compile the protobuf library and generated .pb.cc files this way -- you can
compile your own code using normal settings, since the protobuf headers do
not contain any string or character literals (I think).

Now all you have to do is transform text on its way into or out of the
protobuf library.  Fortunately you will use the library primarily on binary
input and output, so not much transformation will be needed.  The only cases
I can think of is when using TextFormat (which is an entirely optional
feature mostly for debugging) and error messages.  You can use
SetLogHandler() to catch error messages and convert them:

http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.common.html#SetLogHandler.details

With this strategy you should be able to run all the unit tests -- so long
as you also compile them with ASCII literals.

==========================

Note that regardless of which solution you choose above, the "string" field
type in the .proto language is only allowed to contain UTF-8 text.  If you
intend to store EBCDIC text in your protocol messages, you need to use the
"bytes" type instead.  Or, alternatively, you can transform character sets
whenever you read/write a "string" field -- this approach would be more
tedious but would make it easier for ASCII systems to read/write the same
message format.

On Tue, May 19, 2009 at 8:46 PM, David Crayford <dcrayf...@gmail.com> wrote:

> David Crayford wrote:
>
>> http://en.wikipedia.org/wiki/EBCDIC_1047
>>
>> Note the non-contiguous letter sequences. EBCDIC sucks, but it is used on
>> enterprise machines and is still used for the majority of the worlds
>> production data. Protocol Buffers seem like a fantastic
>> alternative to XML, which is proving to be a huge bottleneck for mainframe
>> applications trying to web enable legacy applications.
>>
>> What modules/classes contain the ASCII specific code?
>>
>>
> ok, it looks like the problem is tokenizer.cc - the character classes ain't
> gonna cut it for EBCDIC. I think I could patch that by using cctype, but
> there's some comments in there warning against it's use?
>
>
>  Kenton Varda wrote:
>>
>>> If your compiler transforms string literals to a non-ASCII character set,
>>> then the code generated by the protocol compiler won't work.  We could
>>> perhaps fix this by escaping every character in the embedded descriptor, but
>>> other problems might come up.  I don't have enough experience with EBCDIC to
>>> know what to expect.
>>>
>>> On Tue, May 19, 2009 at 4:03 AM, daveyc <dcrayf...@gmail.com <mailto:
>>> dcrayf...@gmail.com>> wrote:
>>>
>>>
>>>    I built Protocol Buffers on a z/OS mainframe. Build was fine but the
>>>    unit test choked:
>>>
>>>    libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK failed:
>>>    proto.ParseFromArray(data,
>>>    size):
>>>
>>>    Without digging too deep in the code, is the parser capable of
>>>    handling EBCDIC?
>>>    >>>
>>>
>>>
>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to