Re: Suggestions?
On 22.02.2018 05:01, David Starner via Unicode wrote: On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode wrote: Where can I post suggestions and feedback for Unicode? Here is as good as any place. There are specific places for a few specific things, but likely if you do have something thats likely to get changed, youll need the help of someone here to get through the process. It is a quarter-century old technical standard embedded in most electronics, so I would temper any expectations for major changes; it works the way it works because thats the way previous versions worked, and nobody is interested in the trouble changing them would involve. Yes and no. This list is for informal discussion, so someone unsure about things may start here, but posting on this list does not count as feedback or suggestions to Unicode. So by all means post here some of your ideas and understand more. Regards John Knightley Links: -- [1] mailto:unicode@unicode.org
Re: Suggestions?
On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode < unicode@unicode.org> wrote: > Where can I post suggestions and feedback for Unicode? > Here is as good as any place. There are specific places for a few specific things, but likely if you do have something that's likely to get changed, you'll need the help of someone here to get through the process. It is a quarter-century old technical standard embedded in most electronics, so I would temper any expectations for major changes; it works the way it works because that's the way previous versions worked, and nobody is interested in the trouble changing them would involve.
Re: Suggestions?
http://www.unicode.org/faq/faq_on_faqs.html#34
Re: Suggestions?
The Unicode website has a section for feedback in its menu, but in separate projects for TUS and for CLDR. There are also feedbacks requested for every proposed amendment to the standard, annexes, and data. First search the relevant topic on the website, then look at the side bar if there's no specific feedback link on the main page content. Feedback or proposals are submitted within an online form, and will then be forwarded by email to interested subcommities and possible subscribers. For data submission to CLDR, this is done by the survey tool, when it is open. For reference implementations, that have an opensourced repository, feedback is submitted via the links given in the repository itself. Basically, you need to look for the most relevant topic, and then use the appropriate link so that this can be sorted and sent to the correct people. There's also a feedback for questions related to Unicode memberships, or for legal requests. There's also a general feedback link, but don't expect an emergency response, it may take time to reach the right people to get an answer, and unsorted/unqualified feedbacks take time to be classified and extracted from the fog of incoming spams or non-relevant submissions. If you don't know where to post, this mailing list can guide you, but this is not the place to submit a formal request, and various people (including me) may reply to you, and any reply you would receive from this list is not endorsed ofciially by Unicode, this is more a "community" list used to interconnect interested people and discuss about how to improve the proposals, or being guided before submitting a qualified formal request, or ask for peer review before submitting it. 2018-02-21 16:23 GMT+01:00 Jeb Eldridge via Unicode : > > > > > Where can I post suggestions and feedback for Unicode? > > > > >
Re: Suggestions?
On 2/21/2018 7:23 AM, Jeb Eldridge via Unicode wrote: Where can I post suggestions and feedback for Unicode? What kinds of suggestions / what kind of feedback are we talking about? A./
Suggestions?
Where can I post suggestions and feedback for Unicode?
Re: Suggestions in Unicode Indic FAQ
Kent Karlsson wrote: > Consider English. If I write "", that may well be a spell error. Or even "Ŋŋŋŋ!", as Michael Everson wrote in WG2 N2306. -Doug Ewell Fullerton, California
RE: Suggestions in Unicode Indic FAQ
> > No, with proper reordering (and "normal" display mode), the e-matra at > > the beginning of the second word would appear to be last glyph of the > > first "word". Similarly, for the second case, the e-matra glyph would > > have come to the left of the pa. The fluent reader (ok, not me...) > > would then see those errors anyway, just like I can find spelling > > errors in Swedish, most often without any kind of special marking. (I'm > > assuming through-out that reordrant combining characters > are reordered.) > > Illegal sequences There are no illegal sequences. > are not reordered as you indicated. Then that is a problem with the display software you are using. > Also, as far as I > know there is no mention of reordering of illegal input sequence (or > invalid combining mark) in Unicode standard. Again, there are no "illegal input sequences". > Consider the last set of glyphs (left-to-right, top-to-bottom) in the > attached image. It is the rendering effect of illegal input sequence See above. > "Devanagari Vowel Sign I" [U+093F] + "Devanagari Letter Ka" > [U+0915] and without any dotted circle. Let's see if I understand you. <093F, 0915> is the input. Since 093F is a combining character, one should (not must, but should) treat this *as if* the input was <0020, 093F, 0915>. Since 093F is also reordrant, one must reorder it before the preceding base character (at least, more for consonant clusters), so the output glyphs would be <>. (But your image does not show that.) > As you might be knowing the correct input > sequence should be U+0915 followed by U+093F. That would be a different input (whether that is correct or not depends on the authors intent). > In that case the result would > have been similar to what appears right now. Similar ONLY if you disregard the space "glyph" that should have been there. > (Though some more > sophisticated font/application may want to replace the > appearing glyph for > U+093F to be substituted by some other glyph with proper > attachment point). That may be. > Now there is no way that user can identify this illegal input sequence > without dotted circle. Yes, there is. Don't disregard the space "glyph". > In the worst case even this rendered glyph is > attached to the character from a class (for example, > consonant cluster of > "Ka" "Virama" "Ma") for which the glyph has been designed to > render with. > In such case even a fluent reader can not identify the error. > > > > > There are spelling errors, yes. But there are other ways > of indicating > > spelling errors, that are (by now) fairly conventional for > any language > > (as long as there is an appropriate dictionary installed), > and that also > > are more general (in catching more spelling errors) and > less obtrusive > > (the author really wants to write it that way, for some reason). > > > > > Apparently, Michka used a non-OpenType Bengali Unicode font when > > > he embedded the fonts into the page. As long as you are looking > > > at the page on-line, with the embedded fonts, these errors are > > > invisible. > > > > > > It may be typographically horrible. It *should* be > typographically > > > horrible in order to illustrate bad sequences clearly. > > > > I'd prefer little red wiggly lines under the word, or > yellow background > > or some such (just for screen display, not for printing; > screen grabs > > not counted). And that for any spelling "error". > > Spelling mistakes can be categorized into two different classes. ??? > One > arising from illegal input sequence (e.g., Vowel Sign E as the first > character in a word) There are no illegal input sequences. > and the other one is legal input sequence with no > contextual meaning in the dictionary. A simple spell checker just checks if the word is in the dictionary or not (without worrying about the context). That would catch what you call "illegal input sequences" too. > While indication of the second type > of mistake is generally used only in sophisticated > applications like word processor, Why? There is nothing in principle hindering a spell checker to be used in a "plain text" editor. > everyone wants to know the first kind of mistake. Without a spell checker, but with proper rendering, spelling errors can be detected by a fluent reader, since they look different also without any dotted circles. For some ambiguous Indic cases, like a prefix matra, consonant, postfix matra, all possible character sequences for them are misspellings (as far as I know). > With your > explanation it seems that even plain text editor is not > useful at all to identify such common typing mistakes! Consider English. If I write "", that may well be a spell error. Do I deserve to get the rendering of that string to be littered by dotted circles just because a sequence of four n's "has to" be a spell error? /Kent K > - Keyur
RE: Suggestions in Unicode Indic FAQ
> --- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > > > No fallback rendering is coming into picture with your explanation. > > > > Yes, there is. A character sequence (say) > > is very unlikely to have a ligature, specially adapted (and fitting) > > adjustment points, or similar. The rendering would in that sense > > need to use a fallback mechanism that renders an "approximation" > > for this rare combination. > > Do you mean to say that an application has to take care of combination of s/has to/should, also in display,/ > all other Unicode characters with each combining marks in the fallback Including multiple combining marks on one base character. > mechanism for such approximation? Can you count the number of combinations > which may result in millions!? Many, many more. Which is why you need a fallback mechanism (rather than ligatures, adjustment points, etc. which cannot handle that many combinations). In the case of Indic postfix and prefix matras, the general handling is in principle simple: for the postfix ones, nothing special need be done, for the prefix ones (i.e. the reordrant ones) do the reordering (before the preceding base character at least, for certain Indic combinations, move it even earlier). Then the you have the "visual order". I'm ignoring ligature formation here, but that has to be done as well. For the superscript, subscript, and split matras (and other combining marks) the general approach is a bit more complicated. See http://www.unicode.org/notes/tn2/ for hints. /Kent K
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > No fallback rendering is coming into picture with your explanation. > > Yes, there is. A character sequence (say) > is very unlikely to have a ligature, specially adapted (and fitting) > adjustment points, or similar. The rendering would in that sense > need to use a fallback mechanism that renders an "approximation" > for this rare combination. Do you mean to say that an application has to take care of combination of all other Unicode characters with each combining marks in the fallback mechanism for such approximation? Can you count the number of combinations which may result in millions!? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > Without that dotted circle appearing, the e-matra would appear to > > have been properly encoded, > > No, with proper reordering (and "normal" display mode), the e-matra at > the beginning of the second word would appear to be last glyph of the > first "word". Similarly, for the second case, the e-matra glyph would > have come to the left of the pa. The fluent reader (ok, not me...) > would then see those errors anyway, just like I can find spelling > errors in Swedish, most often without any kind of special marking. (I'm > assuming through-out that reordrant combining characters are reordered.) Illegal sequences are not reordered as you indicated. Also, as far as I know there is no mention of reordering of illegal input sequence (or invalid combining mark) in Unicode standard. Consider the last set of glyphs (left-to-right, top-to-bottom) in the attached image. It is the rendering effect of illegal input sequence "Devanagari Vowel Sign I" [U+093F] + "Devanagari Letter Ka" [U+0915] and without any dotted circle. As you might be knowing the correct input sequence should be U+0915 followed by U+093F. In that case the result would have been similar to what appears right now. (Though some more sophisticated font/application may want to replace the appearing glyph for U+093F to be substituted by some other glyph with proper attachment point). Now there is no way that user can identify this illegal input sequence without dotted circle. In the worst case even this rendered glyph is attached to the character from a class (for example, consonant cluster of "Ka" "Virama" "Ma") for which the glyph has been designed to render with. In such case even a fluent reader can not identify the error. > > There are spelling errors, yes. But there are other ways of indicating > spelling errors, that are (by now) fairly conventional for any language > (as long as there is an appropriate dictionary installed), and that also > are more general (in catching more spelling errors) and less obtrusive > (the author really wants to write it that way, for some reason). > > > Apparently, Michka used a non-OpenType Bengali Unicode font when > > he embedded the fonts into the page. As long as you are looking > > at the page on-line, with the embedded fonts, these errors are > > invisible. > > > > It may be typographically horrible. It *should* be typographically > > horrible in order to illustrate bad sequences clearly. > > I'd prefer little red wiggly lines under the word, or yellow background > or some such (just for screen display, not for printing; screen grabs > not counted). And that for any spelling "error". Spelling mistakes can be categorized into two different classes. One arising from illegal input sequence (e.g., Vowel Sign E as the first character in a word) and the other one is legal input sequence with no contextual meaning in the dictionary. While indication of the second type of mistake is generally used only in sophisticated applications like word processor, everyone wants to know the first kind of mistake. With your explanation it seems that even plain text editor is not useful at all to identify such common typing mistakes! - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com <>
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: ... > > No fallback rendering is coming into picture with your explanation. Yes, there is. A character sequence (say) is very unlikely to have a ligature, specially adapted (and fitting) adjustment points, or similar. The rendering would in that sense need to use a fallback mechanism that renders an "approximation" for this rare combination. ... > Here is the para you are talking about. > > [Quote] [...] > should be rendered as if they had a space as a base character." > [/Quote] > > In the text there is no mention of explicitly inputting space character > before any combining mark that is defective combining character. The text says "as if". Which I also emphasised before. > Also, the word "should be rendered" implies that it is recommendation. Yes. A rather good one. > > By removing that particular fallback mechanism from implementations [inserting dotted circle glyphs for allegedly "invalid" combinations] > > as well as the TUS text! (I'm serious!) This particular fallback > > mechanism is NOT recommended as it stands. > > Note that the text has been written in the section "Implementation > Guidelines". Can't it be considered as recommendation? That particular one, no. Just an example [that isn't very good, outside of a general "show invisibles" mode]. > > But since its mention is erroneously taken as a recommendation, I'd > > suggest removing also its mention. > > This is disastrous! What will happen to the systems which already > implemented this recommendations!? It's not a recommendation. > Will they be considered invalid > implementation afterwards? What is about stability? They are ugly implementations as they are. And will stay ugly implementations. Stability is good ;-). /Kent K
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > Clearly, since in this case the sign is not > > preceded by any consonant base, it has to be rendered using one of the > > mechanisms specified in fallback rendering of non-spacing marks. > > If it is preceded by a SPACE (or is first in a string/paragraph/similar) > it should be rendered as a "freestanding" glyph (no dotted circle). If > it > is preceded, in the source string, by, say, FULL STOP, a typographically > acceptable rendering would be to have the vowel sign E glyph float on > top of the glyph for the FULL STOP (no dotted circle). No fallback rendering is coming into picture with your explanation. > > I add that this is a good way of displaying a combining mark that has no > > base character, i.e. one occurring at the begin of a line or paragraph. > No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page > 121). Now here you are really talking about fallback rendering :-). Here is the para you are talking about. [Quote] "In a degenerate case, a nonspacing marks occurs as the first character in the text or is separated from its base character by a line separator, paragraph separator, ot other formatting character that causes a positional separation. This result is called a defective combining character sequence (see chapter 3.5, Combinations). Defective combining character sequences should be rendered as if they had a space as a base character." [/Quote] In the text there is no mention of explicitly inputting space character before any combining mark that is defective combining character. Also, the word "should be rendered" implies that it is recommendation. > > > > Then how can we rake care of fallback mechanism? > > By removing that particular fallback mechanism from implementations > as well as the TUS text! (I'm serious!) This particular fallback > mechanism is NOT recommended as it stands. Note that the text has been written in the section "Implementation Guidelines". Can't it be considered as recommendation? (although not necessary for implementation) > But since its mention is erroneously taken as a recommendation, I'd > suggest removing also its mention. This is disastrous! What will happen to the systems which already implemented this recommendations!? Will they be considered invalid implementation afterwards? What is about stability? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
. Kent Karlsson wrote, > > I add that this is a good way of displaying a combining mark that has no > > base character, i.e. one occurring at the begin of a line or paragraph. > > No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121). So it says. But, the 'space method' could be interpreted as being under the "Simple Overlap" fallback rendering method, since the paragraph from which you are quoting appears immediately after the paragraph treating with "Simple Overlap" method. In the first paragraph under "Fallback Rendering" (bottom of page 120), the "Show Hidden" method is described. It would have been redundant to say that degenerate cases **under the "Show Hidden" method** need to be displayed as if on a dotted circle, since the "Show Hidden" method is **all about displaying on dotted circles** whenever there is an inability to draw the sequence. Like, in the event of an invalid sequence. Whether to use "Show Hidden" or "Simple Overlap" method to display invalid or degenerate sequences should be left up to the various software developers. It does seem to be very easy to spot bad input when the dotted circle appears in the display. Stands out like a sore thumb, which was probably the intention. Perhaps this section of T.U.S. could stand some clarification. Best regards, James Kass .
RE: Suggestions in Unicode Indic FAQ
At 01:20 AM 1/30/2003, Marco Cimarosti wrote: However, I totally agree with Kent that this funny rendering is *not* a requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is just an example of many "several methods [that] are available to deal with" strange sequences. Perhaps there is some confusion here because the use of the dotted circle is an explicit recommendation of the MS Indic OpenType spec, which is what the majority of Indian font developers are now working with. So there may be some confusion between what is expected in the MS spec and what is required by Unicode. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] A book is a visitor whose visits may be rare, or frequent, or so continual that it haunts you like your shadow and becomes a part of you. - al-Jahiz, The Book of Animals
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: > > However, I totally agree with Kent that this funny > rendering is *not* a > > requirement of the Unicode standard, as Keyur Shroff seems > to suggest. It > > is just an example of many "several methods [that] are > available to deal > > with" strange sequences. > > A sequence should not be treated as "strange" sequence if it has been > written intentionally. It may have some contextual meaning. I said "strange" in the sense of character sequences that are not part of the ordinary spelling of any language. In fact, a thing like a matra floating in the air or on a dotted circle is something that you'd only see in a text (not necessarily *in* an Indian language) which talks about spelling, character sets, and the like. > Also, what is good or bad is also subjective. It may also > vary from one script to another. Yes, but what is mandatory and what is not in Unicode sciould not be too much subjective, else we could not call it a "standard". _ Marco
RE: Suggestions in Unicode Indic FAQ
> Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947] > appearing after any non-consonant character. This sign is generally > attached to the consonants. It has zero advance width with negative left > side bearing in the font. Ok. > Clearly, since in this case the sign is not > preceded by any consonant base, it has to be rendered using one of the > mechanisms specified in fallback rendering of non-spacing marks. If it is preceded by a SPACE (or is first in a string/paragraph/similar) it should be rendered as a "freestanding" glyph (no dotted circle). If it is preceded, in the source string, by, say, FULL STOP, a typographically acceptable rendering would be to have the vowel sign E glyph float on top of the glyph for the FULL STOP (no dotted circle). Similarly for a vowel sign E that follows a LATIN CAPITAL LETTER A. (But I don't expect good positioning, just readable.) Again similarly, a vowel sign I that follows an EQUAL SIGN should be rendered as a vowel sign I glyph to the left of an EQUAL SIGN glyph. No dotted circle. (I know that the reordrant vowel signs may reorder over more than the preceding base character IF it is a (sub)string in an Indic script.) Again similarly, a string should be rendered as a KA + II + II glyph sequence (invoking any ligature for KA + II if there is one in the font; II + II is unlikely to have any ligature, since it is not used by any orthography). No dotted circle(s). The fallback hinted at in TUS 3.0 that uses dotted circles is 1) typographically horrible, and 2) cannot indicate that there is any error in the given character sequence. ... > the application. Now in order to render it with dotted circle if we > introduce the circle in the text before this sign then also > the circle is invalid base for this "Vowel Sign E". No base character is invalid for any combining character. ... > > Languages or syllable boundaries have nothing to do with this. These > > special > > sequences should *never* be part of any syllable or word in any language: > > they are just a way of showing the shape of a glyph, to be used when, > > e.g., talking about typography or spelling. > > Then how can we rake care of fallback mechanism? By removing that particular fallback mechanism from implementations as well as the TUS text! (I'm serious!) This particular fallback mechanism is NOT recommended as it stands. But since its mention is erroneously taken as a recommendation, I'd suggest removing also its mention. That mechanism is as bad as misplacing glyphs for combining marks on the glyph(s) for the follow-on character, if not worse. ("Show invisibles" (for all of the text or a "user" selected run of the text) is an entirely different story.) /Kent K
RE: Suggestions in Unicode Indic FAQ
> > I don't know where you find support for that position in that text. > > Can you please quote? There are no "invalid base consonants" for > > any dependent vowel (for Indic scripts; similarly for any > > other script). > > Actually, there is a mention of displaying combining marks on dotted > circles: I know. But there is no mention (that I have found) of "invalid base characters" or any recommendation for using dotted circles especially for Indic scripts. > I add that this is a good way of displaying a combining mark that has no > base character, i.e. one occurring at the begin of a line or paragraph. No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121). /Kent K
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti <[EMAIL PROTECTED]> wrote: > > I add that this is a good way of displaying a combining mark that has no > base character, i.e. one occurring at the begin of a line or paragraph. > > However, I totally agree with Kent that this funny rendering is *not* a > requirement of the Unicode standard, as Keyur Shroff seems to suggest. It > is just an example of many "several methods [that] are available to deal > with" strange sequences. A sequence should not be treated as "strange" sequence if it has been written intentionally. It may have some contextual meaning. > > > Any combining characters can be placed on any base characters without > > there being any dotted circles displayed. Not only that, but it is also desirable. How can one write a vowel matra both with and without dotted circle in a single document if Unicode recommends to place it only on top of space character? Matra with dotted circle is sometimes useful as in the case of printing/explaining Unicode standard. A user may want to hide dotted circle in the same document in order to explain the actual shape of the matra character, i.e., without dotted circle. Both kind of rendering behaviour is possible. There should be some mechanism either to turn on or off dotted circle depending on the default behaviour. Also, what is good or bad is also subjective. It may also vary from one script to another. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
Kent Karlsson wrote: > Keyur Shroff wrote > [...] > > In Indic scripts any sign that appear in text not in > > conjunction with a > > valid consonant base may be rendered with dotted circle as fallback > > mechanism (Section 5.14 "Rendering Nonspacing Marks" > > http://www.unicode.org/uni2book/ch05.pdf). > > I don't know where you find support for that position in that text. > Can you please quote? There are no "invalid base consonants" for > any dependent vowel (for Indic scripts; similarly for any > other script). Actually, there is a mention of displaying combining marks on dotted circles: "Several methods are available to deal with an unknown composed character sequence that is outside of a fixed, renderable set [...]. One method (Show Hidden) indicates the inability to draw the sequence by drawing the base character first and then rendering the nonspacing mark as an individual unit - with the nonspacing mark positioned on a dotted circle." (The Unicode Standard 3.0, page 120 - 5.14 Rendering Nonspacing Marks - Fallback Rendering) I add that this is a good way of displaying a combining mark that has no base character, i.e. one occurring at the begin of a line or paragraph. However, I totally agree with Kent that this funny rendering is *not* a requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is just an example of many "several methods [that] are available to deal with" strange sequences. > > Any system implementing this as > > default behaviour should not be considered buggy. > > Indeed they are. And it should certainly not be default behaviour. In this case, I disagree with Kent: displaying these dotted circles is not mandatory, but certainly not a bug. > Any combining characters can be placed on any base characters without > there being any dotted circles displayed. True. But notice that Kent (against his own opinion) correctly wrote "can", not "must". > [...] _ Marco
Re: Suggestions in Unicode Indic FAQ
To support what Kayur has to say I will add few more things. Take for instance a "vowel sigh" (matras as we call here in India) e.g. say is e (U093F), is combined with a consonant like ka (U0915) in the sequence it forms ke. (Please see the first image). The repositioning of the shape happens automatically. If the anyone puts the e matra (U093F) first and then the consonant ka (U0915) then they should be highlighted, putting a space can still make them look like ka. So to make this mistake very explicit, we have to put a dotted circle. If wrong combination is stored, it will create lot of problem in searching the data as well as in sorting. Please refer BIS (Bureau of Indian Standards) ISCII 91 / 88 documentation for this where in -Aditya - Original Message - From: "Keyur Shroff" <[EMAIL PROTECTED]> To: "'Unicode Mailing List'" <[EMAIL PROTECTED]> > --- Marco Cimarosti <[EMAIL PROTECTED]> wrote:> > Keyur Shroff wrote:> > > But sometimes a user may want visual representation of these > > > symbols in two different ways: with dotted circle and> > > without dotted circle.> > > > Why not using a dotted circle character explicity, when you want to see> > one?> > Note that whenever I mention the word "combining mark" I am really talking> about "vowel signs (matras)" and other modifiers in Indic scripts which is> script dependent. I am sorry if I have confused you with the combining> diacritical marks in the block [U+0300-U+036F] which I really didn't mean.> > Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947]> appearing after any non-consonant character. This sign is generally> attached to the consonants. It has zero advance width with negative left> side bearing in the font. Clearly, since in this case the sign is not> preceded by any consonant base, it has to be rendered using one of the> mechanisms specified in fallback rendering of non-spacing marks. If we> render it with space, as you said, then we have to insert "space" character> at the time of fallback rendering (which can be taken care in rendering> pipeline) even though space character is not present in backing store of> the application. Now in order to render it with dotted circle if we> introduce the circle in the text before this sign then also the circle is> invalid base for this "Vowel Sign E". As a result, again fallback rendering> will take place with rendering circle and the vowel sign positionally> separate. In this case first dotted circle will apear which will be> followed by vowel sign (matra) on top of space character.> > If you know any other way to solve this problem then please explain. Also> let me know if I have misinterpreted the text written in Unicode standard.> > > > > > > Example of> > > this could be RAsup on top of dotted circle and RAsup on top of space> > > character. Current use of space character to eliminate dotted > > > circle is really painful and may create problems in determining > > > language and syllable boundaries.> > > > Languages or syllable boundaries have nothing to do with this. These> > special> > sequences should *never* be part of any syllabe or word in any language:> > they are just a way of showing the shape of a glyph, to be used when,> > e.g., talking about typography or spelling.> > Then how can we rake care of fallback mechanism?> > > Thanks for taking pain for answering my queries :-)> > - Keyur> > > > __> Do you Yahoo!?> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.> http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti <[EMAIL PROTECTED]> wrote: > Keyur Shroff wrote: > > But sometimes a user may want visual representation of these > > symbols in two different ways: with dotted circle and > > without dotted circle. > > Why not using a dotted circle character explicity, when you want to see > one? Note that whenever I mention the word "combining mark" I am really talking about "vowel signs (matras)" and other modifiers in Indic scripts which is script dependent. I am sorry if I have confused you with the combining diacritical marks in the block [U+0300-U+036F] which I really didn't mean. Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947] appearing after any non-consonant character. This sign is generally attached to the consonants. It has zero advance width with negative left side bearing in the font. Clearly, since in this case the sign is not preceded by any consonant base, it has to be rendered using one of the mechanisms specified in fallback rendering of non-spacing marks. If we render it with space, as you said, then we have to insert "space" character at the time of fallback rendering (which can be taken care in rendering pipeline) even though space character is not present in backing store of the application. Now in order to render it with dotted circle if we introduce the circle in the text before this sign then also the circle is invalid base for this "Vowel Sign E". As a result, again fallback rendering will take place with rendering circle and the vowel sign positionally separate. In this case first dotted circle will apear which will be followed by vowel sign (matra) on top of space character. If you know any other way to solve this problem then please explain. Also let me know if I have misinterpreted the text written in Unicode standard. > > > Example of > > this could be RAsup on top of dotted circle and RAsup on top of space > > character. Current use of space character to eliminate dotted > > circle is really painful and may create problems in determining > > language and syllable boundaries. > > Languages or syllable boundaries have nothing to do with this. These > special > sequences should *never* be part of any syllabe or word in any language: > they are just a way of showing the shape of a glyph, to be used when, > e.g., talking about typography or spelling. Then how can we rake care of fallback mechanism? Thanks for taking pain for answering my queries :-) - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote > Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > A space followed by a dependent vowel sign should display just the > > dependent vowel sign, no dotted circle. Indeed, (except for a "show > > invisibles" mode, or a "character chart" display mode) no (Indic or > > other) > > text that does not contain the *character* DOTTED CIRCLE should ever > > display a dotted circle as part of the displayed text. Systems that > > do display a dotted circle (in normal display mode) where there is > > no such *character* in the displayed text are buggy! > > In Indic scripts any sign that appear in text not in > conjunction with a > valid consonant base may be rendered with dotted circle as fallback > mechanism (Section 5.14 "Rendering Nonspacing Marks" > http://www.unicode.org/uni2book/ch05.pdf). I don't know where you find support for that position in that text. Can you please quote? There are no "invalid base consonants" for any dependent vowel (for Indic scripts; similarly for any other script). > Any system implementing this as > default behaviour should not be considered buggy. Indeed they are. And it should certainly not be default behaviour. Any combining characters can be placed on any base characters without there being any dotted circles displayed. In particular, any combining Devanagari characters (note: including, in principle, several dependent vowels, even if that does not occur in any (existing) orthography) can be placed on any Devanagari base character as well as SPACE (and other punctuation). What should result is a reasonable composed glyph, no dotted circle in sight (except in show invisibles mode, which I'm not discussing here). Spelling errors should be indicated otherwise, since they are of a very different nature. > For scripts other than Indic scripts, it may be useful to render the > nonspacing mark without dotted circle because even after > rendering it as an > overlap glyph, the result is recognizable. However, for Indic > scripts use > of dotted circle is very useful as default behaviour since it gives > immediate feedback to the user that there may be some > defective combining > character in the text. Most of the time such errors are unintentional > rather than intentional. No combination of base + combining characters is defective per se. Even if the scripts are different within the combining sequence. (Note also that the 0300 block of combining characters are script independent.) Spelling errors is something else entirely. > Unicode has provision to remove this dotted circle. I'm not sure what you are talking about here. > Space > character is used > to give indication to fallback mechanism that no dotted > circle should be > used while rendering this stand alone sign which is normally > attached to > other characters. This is useful when sometimes user want to > display the > sign without any circle. Also, with this scheme it is > possible to show some > combining marks with dotted circle and some without dotted circle. The fallback mechanisms talked about in section 5.14 of TUS 3.0 is the use of less than ideal (typographically!) mechanisms to display an *approximation* of the glyph(s) for the combining sequence. An exceedingly bad approximation is displaying a dotted circle as a fake base (again: disregarding "show invisibles", or "chart" modes, which, however, should be consistent and show a dotted circle fake base for ALL combining characters occurring in the text). The use of this exceedingly bad approximation (in normal display mode) does in no way indicate that the combining sequence is at all defective. It may indicate that the display engine (or the font) is defective... /Kent K
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: > But sometimes a user may want visual representation of these > symbols in two different ways: with dotted circle and > without dotted circle. Why not using a dotted circle character explicity, when you want to see one? > Example of > this could be RAsup on top of dotted circle and RAsup on top of space > character. Current use of space character to eliminate dotted > circle is really painful and may create problems in determining > language and syllable boundaries. Languages or syllable boundaries have nothing to do with this. These special sequences should *never* be part of any syllabe or word in any language: they are just a way of showing the shape of a glyph, to be used when, e.g., talking about typography or spelling. > The main problem with space character is that unlike > ZWJ/ZWNJ/Dotted Circle, it falls within the range of other > important script "Latin". Plain wrong! White-space characters and punctuation do not belong to any script: character such as " ", "!" and "?" are used for many scripts and languages. Even the "danda" punctuation, which is in the Devanagari range, does not belong to Devanagari: it is also used for other Indic scripts. > Use of INV character in one shot can solve all these > problems. We can put it in "consonant" class which > can help text processing applications. [...] How can calling a "consonant" something which has nothing to do with consonants help anybody doing anything? _ Marco
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > A space followed by a dependent vowel sign should display just the > dependent vowel sign, no dotted circle. Indeed, (except for a "show > invisibles" mode, or a "character chart" display mode) no (Indic or > other) > text that does not contain the *character* DOTTED CIRCLE should ever > display a dotted circle as part of the displayed text. Systems that > do display a dotted circle (in normal display mode) where there is > no such *character* in the displayed text are buggy! In Indic scripts any sign that appear in text not in conjunction with a valid consonant base may be rendered with dotted circle as fallback mechanism (Section 5.14 "Rendering Nonspacing Marks" http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as default behaviour should not be considered buggy. What should be the default rendering behaviour (i.e., show hidden or not) may vary from one script to another script and also depends on implementation policy. For scripts other than Indic scripts, it may be useful to render the nonspacing mark without dotted circle because even after rendering it as an overlap glyph, the result is recognizable. However, for Indic scripts use of dotted circle is very useful as default behaviour since it gives immediate feedback to the user that there may be some defective combining character in the text. Most of the time such errors are unintentional rather than intentional. Unicode has provision to remove this dotted circle. Space character is used to give indication to fallback mechanism that no dotted circle should be used while rendering this stand alone sign which is normally attached to other characters. This is useful when sometimes user want to display the sign without any circle. Also, with this scheme it is possible to show some combining marks with dotted circle and some without dotted circle. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti <[EMAIL PROTECTED]> wrote: > Why not representing INV with a double ZWJ? E.g.: > > ISCII Unicode > KA halant INV KA virama ZWJ ZWJ > RA halant INV RA virama ZWJ ZWJ (i.e., repha) > INV halant RA ZWJ ZWJ virama RA (RAsub) > > This has the advantage that the most common sequences will work OK also > on > old display engines implemented *before* the double-ZWJ convention is > introduced. > > E.g., sequence "KA virama ZWJ ZWJ" works well also on an old engine, for > the > simple reason that the first ZWJ is enough to do the work, and the > second ZWJ is invisible. > > Of course, an old engine will still display a for virama > ZWJ ZWJ>, but that is not worse than displaying followed by a > white box, which is what would happen with your new INV character. Certainly. This looks more promising because even RAsub has two alternate forms. One form is used with consonants KA, KHA, GHA, etc and the other form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based scheme we can insert as many ZWJ as we wish to produce all possible alternate forms! But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. The main problem with space character is that unlike ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script "Latin". Finally it may affect all important text processing which uses Unicode characters to find language boundaries. Use of INV character in one shot can solve all these problems. We can put it in "consonant" class which can help text processing applications. Moreover, it will be difficult for all possible to provide upward compatibility all the time even though it is desirable. Implementation of Unicode will need to be upgraded with every introduction of new glyphs or rules. Otherwise applications have to explicitly declare the version of Unicode used in implementation. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
> The [new] INV character in Unicode can also be used for displaying dependent > vowel matras without dotted circle. A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a "show invisibles" mode, or a "character chart" display mode) no (Indic or other) text that does not contain the *character* DOTTED CIRCLE should ever display a dotted circle as part of the displayed text. Systems that do display a dotted circle (in normal display mode) where there is no such *character* in the displayed text are buggy! /Kent K (B.t.w. the chart dotted circle glyph for combining characters look a bit different from the (normal) glyph for DOTTED CIRLCE.)
RE: Suggestions in Unicode Indic FAQ
Keyur Shroff wrote: > In the FAQ >http://www.unicode.org/faq/indic.html#16 > > It is mentioned that following are equivalent > > ISCII Unicode > KA halant INV KA virama ZWJ > RA halant INV RAsup (i.e., repha) The last line is really bizarre! I would agree that it is plain wrong... What is supposed to appear in column "Unicode" is the Unicode *encoding* equivalent to the in the "ISCII" column. But "RAsup (i.e., repha)" is the description of a *glyph*. > In fact there is no way in Unicode to produce RAsup directly, > i.e., without using base consonant. [...] I agree. This issue has been raised several times, and several viable solutions have been proposed, but I don't remember that Unicode "officials" ever showed to even acknowledge the problem. But probably this has been noted down and discussed. I hope to see an official solution in TUS 4.0. > SUGGESTION-3: > > Use of SPACE character as consonant may create problem for > state machine which finds language/syllable boundary. > In fact we need a codepoint for one invisible consonant > (similar to INV in ISCII) in Unicode which can solve > this problem with Unicode. > > After inclusion of INV character the following can be recommended. > > ISCII Unicode > KA halant INV KA virama INV > RA halant INV RA virama INV (i.e., repha) > INV halant RA INV virama RA (RAsub) Why not representing INV with a double ZWJ? E.g.: ISCII Unicode KA halant INV KA virama ZWJ ZWJ RA halant INV RA virama ZWJ ZWJ (i.e., repha) INV halant RA ZWJ ZWJ virama RA (RAsub) This has the advantage that the most common sequences will work OK also on old display engines implemented *before* the double-ZWJ convention is introduced. E.g., sequence "KA virama ZWJ ZWJ" works well also on an old engine, for the simple reason that the first ZWJ is enough to do the work, and the second ZWJ is invisible. Of course, an old engine will still display a for , but that is not worse than displaying followed by a white box, which is what would happen with your new INV character. _ Marco
Suggestions in Unicode Indic FAQ
Hello, There are few discrepancies in Indic FAQ. Though it was reported earlier by Andy White, I see they still have place there in the FAQ. I also clarified it but by mistake I sent the mail to Yahoo groups where this mailing list is archived and hence my mail never reached to this mailing list. You can refer to the link http://groups.yahoo.com/group/unicode/message/16352 The following are the suggestions. SUGGESTION-1: In the FAQ http://www.unicode.org/faq/indic.html#2 it is mentioned that ISCII: Unicode: Halant + Halant Halant + ZWJ produce similar result. This is wrong. In ISCII, Halant+Halant is known as explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ should be replaced by ZWNJ. SUGGESTION-2: In the FAQ http://www.unicode.org/faq/indic.html#16 It is mentioned that following are equivalent ISCII Unicode KA halant INV KA virama ZWJ RA halant INV RAsup (i.e., repha) In fact there is no way in Unicode to produce RAsup directly, i.e., without using base consonant. The sequence "RA virama ZWJ" will actually produce half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can also be produced with the sequence "RA Halant Nukta" sequence both in ISCII (known as soft halant) and Unicode (just for conformance with ISCII). Also, in the same answer the following sequence is recommended. ISCII Unicode INV halant RA SPACE virama RA (RAsub) SUGGESTION-3: Use of SPACE character as consonant may create problem for state machine which finds language/syllable boundary. In fact we need a codepoint for one invisible consonant (similar to INV in ISCII) in Unicode which can solve this problem with Unicode. After inclusion of INV character the following can be recommended. ISCII Unicode KA halant INV KA virama INV RA halant INV RA virama INV (i.e., repha) INV halant RA INV virama RA (RAsub) The INV character in Unicode can also be used for displaying dependent vowel matras without dotted circle. Unicode INV Vowel sign O INV Vowel sign AI etc. This can replace existing definition of "SPACE" as invisible consonant depending on the context. Any other pointers!!? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Suggestions for next print edition
> You can always search the big Unihan.txt file on the kJapaneseKun > and kJapaneseOn fields, which provide whatever information we have > on pronunciation of the characters in Japanese. > > If you are just stuck looking up stuff because it isn't marked up > for Japanese, try getting Sanseido's Unicode Kanji > Information Dictionary, which has the first 20,902 kanji in Unicode > (the most useful set) all marked up with all the Japanese pronunciations > (where they have any). The first suggestion is useless. The file is too freaking big so maybe I'll go with the second. Thanks. -- ___ Get your free email from http://www.ranmamail.com Powered by Outblaze
Re: Suggestions for next print edition
11-digit boy suggested: > 1. Unicode points are NUMBERS. Numbers can be written in ANY base. > Knowing decimal values of codepoints is sometimes useful, so please > print them in the next edition of the Unicode book. The UTC has already decided not to do that, as it clutters up the charts. Hexadecimal is far more useful to most implementers of the standard. Hex/decimal conversion is only as far away as the little calculator accessories available on any OS these days. > > 2. There was a Shift-JIS index for kanji. I don't know much about > kanji, but it seems to me that they are arranged in a-i-u-e-o order > of on'yomi. Why not print little hiragana letters at the top to aid > people searching for a kanji? Again, this is not the function of the charts. The radical/stroke index is available for general lookup, but we cannot provide phonetic indices, too, for Japanese, Chinese, Korean, and Vietnamese lookup. You can always search the big Unihan.txt file on the kJapaneseKun and kJapaneseOn fields, which provide whatever information we have on pronunciation of the characters in Japanese. If you are just stuck looking up stuff because it isn't marked up for Japanese, try getting Sanseido's Unicode Kanji Information Dictionary, which has the first 20,902 kanji in Unicode (the most useful set) all marked up with all the Japanese pronunciations (where they have any). I suspect that Sanseido will soon be updating that dictionary to include Vertical Extension A, as well. --Ken
Suggestions for next print edition
1. Unicode points are NUMBERS. Numbers can be written in ANY base. Knowing decimal values of codepoints is sometimes useful, so please print them in the next edition of the Unicode book. 2. There was a Shift-JIS index for kanji. I don't know much about kanji, but it seems to me that they are arranged in a-i-u-e-o order of on'yomi. Why not print little hiragana letters at the top to aid people searching for a kanji? Remember how I could not find the "ran" of "randamu" before? Let's see this time... Aha! There is is! I know it was somewhere between "mo(kuyoubi)" and "(fu)ro". Better than stroke / radical, I wonder? * Disclaimer: From what I hear, the Japanese do NOT write "randamu" as U+4E71 U+3060 U+3080. They use U+30E9 U+30F3 U+30C0 U+30E0. But the first is cuter. ^_^ -- ___ Get your free email from http://www.ranmamail.com Powered by Outblaze
Re: unicode + oracle query....... (suggestions needed...)
Sandeep, Can you explain exactly what you are doing to get the data from ASP into the Oracle database? Perhaps post the ASP code? Like most scriptoing languages, VBScript and JScript both support UCS-2, and it is really usually the Oracle ODBC or OLE DB driver that has the job of converting the text from UCS-2 to UTF-8. I would wonder if what you are seeing is some type of "double conversion?" So the things that would be interesting to know: 1) The data access method to Oracle 2) Version of the driver being used 3) A sample of the code/script being used michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Sandeep Krishna" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 3:12 AM Subject: Re: unicode + oracle query... (suggestions needed...) > i mean all the entries at both Web server machine's registry and Oracle > Database server machine's registry or either one. > in our setup... my machine is the Web Server and the Oracle Server is a > separate machine > please clarify > > regards, > > Sandeep > - Original Message - > From: Kedar Moghe <[EMAIL PROTECTED]> > To: Unicode List <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 3:21 PM > Subject: RE: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > I think you need to change at following three places, > HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG > HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG > HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG > > Best of luck > > Regards, > > Kedar > > -Original Message- > From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, September 27, 2000 5:45 PM > To: Carl W. Brown; Bob Verbrugge; Kedar Moghe > Cc: [EMAIL PROTECTED] > Subject: Re: unicode + oracle query... (suggestions needed...) > > > hi... > > i m thoroughly confused. > actually the registry entries for oracle shows 3 entries for NLS_LANG. > and that too at the WEB SERVER end and at the DATABASE SERVER end. > so that makes tooo many combinations... > > can someone indicate which of these NLS_LANG entries have to be set as > "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly > should be there > > pls suggest necessary messures.. > > regards, > > Sandeep > > > > > - Original Message - > From: Bob Verbrugge <[EMAIL PROTECTED]> > To: Sandeep Krishna <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 1:30 PM > Subject: Re: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > You probably need to change the NLS_LANG Oracle setting in the registry. > Look under > HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the > character set part to UTF8. > > Bob. > > > - Original Message - > From: "Sandeep Krishna" <[EMAIL PROTECTED]> > To: "Unicode List" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 9:16 AM > Subject: Re: unicode + oracle query... (suggestions needed...) > > > > hi, > > > > thankx for responding. > > > > but when u mention change in the registry.. > > could u elaborate about where exactly in reg and what changes are required > > > > my registry setting shows NLS = American_English.UTF8. > > > > is this the setting u indicated..or something to so with the charset entry > : > > autodetect and autodetect_all (in classid...Mime>database>charset..) > > > > pls do elaborate > > > > regards, > > > > Sandeep > > > > > > > > - Original Message - > > From: Kedar Moghe <[EMAIL PROTECTED]> > > To: 'Sandeep Krishna' <[EMAIL PROTECTED]> > > Sent: Wednesday, September 27, 2000 11:20 AM > > Subject: RE: unicode + oracle query... (suggestions needed...) > > > > > > Sandeep, > > > > I think you need to set the registry charset to UTF8 where database is > > installed. We were was getting the same problem when we use to send UTF-8 > > strings to oracle database after conversion from Shift-JIS to UTF8. That > > time also the byte sequence of the retrieved string is getting changed and > > some of the bytes are getting replaced with BF. > > > > Regards, > > > > Kedar > > > > -Original Message- > > From: Sandeep Krishna [mailto:[E
RE: unicode + oracle query....... (suggestions needed...)
Only registry entries on the database machine. Not any other entry. Regsrds, Kedar -Original Message- From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 27, 2000 6:21 PM To: Kedar Moghe Cc: [EMAIL PROTECTED] Subject: Re: unicode + oracle query... (suggestions needed...) i mean all the entries at both Web server machine's registry and Oracle Database server machine's registry or either one. in our setup... my machine is the Web Server and the Oracle Server is a separate machine please clarify regards, Sandeep - Original Message - From: Kedar Moghe <[EMAIL PROTECTED]> To: Unicode List <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 3:21 PM Subject: RE: unicode + oracle query... (suggestions needed...) Sandeep, I think you need to change at following three places, HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG Best of luck Regards, Kedar -Original Message- From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 27, 2000 5:45 PM To: Carl W. Brown; Bob Verbrugge; Kedar Moghe Cc: [EMAIL PROTECTED] Subject: Re: unicode + oracle query... (suggestions needed...) hi... i m thoroughly confused. actually the registry entries for oracle shows 3 entries for NLS_LANG. and that too at the WEB SERVER end and at the DATABASE SERVER end. so that makes tooo many combinations... can someone indicate which of these NLS_LANG entries have to be set as "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly should be there pls suggest necessary messures.. regards, Sandeep - Original Message - From: Bob Verbrugge <[EMAIL PROTECTED]> To: Sandeep Krishna <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 1:30 PM Subject: Re: unicode + oracle query... (suggestions needed...) Sandeep, You probably need to change the NLS_LANG Oracle setting in the registry. Look under HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the character set part to UTF8. Bob. - Original Message - From: "Sandeep Krishna" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 9:16 AM Subject: Re: unicode + oracle query... (suggestions needed...) > hi, > > thankx for responding. > > but when u mention change in the registry.. > could u elaborate about where exactly in reg and what changes are required > > my registry setting shows NLS = American_English.UTF8. > > is this the setting u indicated..or something to so with the charset entry : > autodetect and autodetect_all (in classid...Mime>database>charset..) > > pls do elaborate > > regards, > > Sandeep > > > > - Original Message - > From: Kedar Moghe <[EMAIL PROTECTED]> > To: 'Sandeep Krishna' <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 11:20 AM > Subject: RE: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > I think you need to set the registry charset to UTF8 where database is > installed. We were was getting the same problem when we use to send UTF-8 > strings to oracle database after conversion from Shift-JIS to UTF8. That > time also the byte sequence of the retrieved string is getting changed and > some of the bytes are getting replaced with BF. > > Regards, > > Kedar > > -Original Message- > From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, September 27, 2000 11:36 AM > To: Unicode List > Subject: unicode + oracle query... (suggestions needed...) > > > hi > > actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode > cahracters to an Oracle DB table (varchar2 field)... and then retrieve them > back.. > (i used UTF-8 encoding for both writing to the database and also for > retriving and displaying..) > > there were some amazing observations... > > * each unicode character was taking 7 bytes in the database. (instead of > expected 2 or 3...) > * some unicode characters(or rather code points.) like' F95F' when encoded > in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as > EF A5 9F.. in fact many unicode charcters whose encoded form had to had a > byte in the range (80..9F) were being somehow changed to BF ... thus > resulting in incorrect retrieval > > I was unable to find the reasons for these strange occurrences > Pls suggest what could be the causes for these.. > > regards, > > Sandeep. > > > > > *** > SANDEEP KRISHNA > Member Technical Staff (Priceline.com) > H.C.L. Technologies Limited > A-1 C&D, Sector -16, NOIDA, UP, India. > Ph: 91-11-91-4516321 (extn. 1062) > Fax: 91-11-91-4510713, 4510226 > E-Mail : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > > >
Re: unicode + oracle query....... (suggestions needed...)
i mean all the entries at both Web server machine's registry and Oracle Database server machine's registry or either one. in our setup... my machine is the Web Server and the Oracle Server is a separate machine please clarify regards, Sandeep - Original Message - From: Kedar Moghe <[EMAIL PROTECTED]> To: Unicode List <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 3:21 PM Subject: RE: unicode + oracle query... (suggestions needed...) Sandeep, I think you need to change at following three places, HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG Best of luck Regards, Kedar -Original Message- From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 27, 2000 5:45 PM To: Carl W. Brown; Bob Verbrugge; Kedar Moghe Cc: [EMAIL PROTECTED] Subject: Re: unicode + oracle query... (suggestions needed...) hi... i m thoroughly confused. actually the registry entries for oracle shows 3 entries for NLS_LANG. and that too at the WEB SERVER end and at the DATABASE SERVER end. so that makes tooo many combinations... can someone indicate which of these NLS_LANG entries have to be set as "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly should be there pls suggest necessary messures.. regards, Sandeep - Original Message - From: Bob Verbrugge <[EMAIL PROTECTED]> To: Sandeep Krishna <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 1:30 PM Subject: Re: unicode + oracle query... (suggestions needed...) Sandeep, You probably need to change the NLS_LANG Oracle setting in the registry. Look under HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the character set part to UTF8. Bob. - Original Message - From: "Sandeep Krishna" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 9:16 AM Subject: Re: unicode + oracle query... (suggestions needed...) > hi, > > thankx for responding. > > but when u mention change in the registry.. > could u elaborate about where exactly in reg and what changes are required > > my registry setting shows NLS = American_English.UTF8. > > is this the setting u indicated..or something to so with the charset entry : > autodetect and autodetect_all (in classid...Mime>database>charset..) > > pls do elaborate > > regards, > > Sandeep > > > > - Original Message - > From: Kedar Moghe <[EMAIL PROTECTED]> > To: 'Sandeep Krishna' <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 11:20 AM > Subject: RE: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > I think you need to set the registry charset to UTF8 where database is > installed. We were was getting the same problem when we use to send UTF-8 > strings to oracle database after conversion from Shift-JIS to UTF8. That > time also the byte sequence of the retrieved string is getting changed and > some of the bytes are getting replaced with BF. > > Regards, > > Kedar > > -Original Message- > From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, September 27, 2000 11:36 AM > To: Unicode List > Subject: unicode + oracle query... (suggestions needed...) > > > hi > > actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode > cahracters to an Oracle DB table (varchar2 field)... and then retrieve them > back.. > (i used UTF-8 encoding for both writing to the database and also for > retriving and displaying..) > > there were some amazing observations... > > * each unicode character was taking 7 bytes in the database. (instead of > expected 2 or 3...) > * some unicode characters(or rather code points.) like' F95F' when encoded > in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as > EF A5 9F.. in fact many unicode charcters whose encoded form had to had a > byte in the range (80..9F) were being somehow changed to BF ... thus > resulting in incorrect retrieval > > I was unable to find the reasons for these strange occurrences > Pls suggest what could be the causes for these.. > > regards, > > Sandeep. > > > > > *** > SANDEEP KRISHNA > Member Technical Staff (Priceline.com) > H.C.L. Technologies Limited > A-1 C&D, Sector -16, NOIDA, UP, India. > Ph: 91-11-91-4516321 (extn. 1062) > Fax: 91-11-91-4510713, 4510226 > E-Mail : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > > >
RE: unicode + oracle query....... (suggestions needed...)
Sandeep, I think you need to change at following three places, HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG Best of luck Regards, Kedar -Original Message- From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 27, 2000 5:45 PM To: Carl W. Brown; Bob Verbrugge; Kedar Moghe Cc: [EMAIL PROTECTED] Subject: Re: unicode + oracle query... (suggestions needed...) hi... i m thoroughly confused. actually the registry entries for oracle shows 3 entries for NLS_LANG. and that too at the WEB SERVER end and at the DATABASE SERVER end. so that makes tooo many combinations... can someone indicate which of these NLS_LANG entries have to be set as "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly should be there pls suggest necessary messures.. regards, Sandeep - Original Message - From: Bob Verbrugge <[EMAIL PROTECTED]> To: Sandeep Krishna <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 1:30 PM Subject: Re: unicode + oracle query... (suggestions needed...) Sandeep, You probably need to change the NLS_LANG Oracle setting in the registry. Look under HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the character set part to UTF8. Bob. - Original Message - From: "Sandeep Krishna" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 9:16 AM Subject: Re: unicode + oracle query... (suggestions needed...) > hi, > > thankx for responding. > > but when u mention change in the registry.. > could u elaborate about where exactly in reg and what changes are required > > my registry setting shows NLS = American_English.UTF8. > > is this the setting u indicated..or something to so with the charset entry : > autodetect and autodetect_all (in classid...Mime>database>charset..) > > pls do elaborate > > regards, > > Sandeep > > > > - Original Message - > From: Kedar Moghe <[EMAIL PROTECTED]> > To: 'Sandeep Krishna' <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 11:20 AM > Subject: RE: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > I think you need to set the registry charset to UTF8 where database is > installed. We were was getting the same problem when we use to send UTF-8 > strings to oracle database after conversion from Shift-JIS to UTF8. That > time also the byte sequence of the retrieved string is getting changed and > some of the bytes are getting replaced with BF. > > Regards, > > Kedar > > -Original Message- > From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, September 27, 2000 11:36 AM > To: Unicode List > Subject: unicode + oracle query... (suggestions needed...) > > > hi > > actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode > cahracters to an Oracle DB table (varchar2 field)... and then retrieve them > back.. > (i used UTF-8 encoding for both writing to the database and also for > retriving and displaying..) > > there were some amazing observations... > > * each unicode character was taking 7 bytes in the database. (instead of > expected 2 or 3...) > * some unicode characters(or rather code points.) like' F95F' when encoded > in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as > EF A5 9F.. in fact many unicode charcters whose encoded form had to had a > byte in the range (80..9F) were being somehow changed to BF ... thus > resulting in incorrect retrieval > > I was unable to find the reasons for these strange occurrences > Pls suggest what could be the causes for these.. > > regards, > > Sandeep. > > > > > *** > SANDEEP KRISHNA > Member Technical Staff (Priceline.com) > H.C.L. Technologies Limited > A-1 C&D, Sector -16, NOIDA, UP, India. > Ph: 91-11-91-4516321 (extn. 1062) > Fax: 91-11-91-4510713, 4510226 > E-Mail : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > > >
Re: unicode + oracle query....... (suggestions needed...)
hi... i m thoroughly confused. actually the registry entries for oracle shows 3 entries for NLS_LANG. and that too at the WEB SERVER end and at the DATABASE SERVER end. so that makes tooo many combinations... can someone indicate which of these NLS_LANG entries have to be set as "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly should be there pls suggest necessary messures.. regards, Sandeep - Original Message - From: Bob Verbrugge <[EMAIL PROTECTED]> To: Sandeep Krishna <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 1:30 PM Subject: Re: unicode + oracle query... (suggestions needed...) Sandeep, You probably need to change the NLS_LANG Oracle setting in the registry. Look under HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the character set part to UTF8. Bob. - Original Message - From: "Sandeep Krishna" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 9:16 AM Subject: Re: unicode + oracle query... (suggestions needed...) > hi, > > thankx for responding. > > but when u mention change in the registry.. > could u elaborate about where exactly in reg and what changes are required > > my registry setting shows NLS = American_English.UTF8. > > is this the setting u indicated..or something to so with the charset entry : > autodetect and autodetect_all (in classid...Mime>database>charset..) > > pls do elaborate > > regards, > > Sandeep > > > > - Original Message - > From: Kedar Moghe <[EMAIL PROTECTED]> > To: 'Sandeep Krishna' <[EMAIL PROTECTED]> > Sent: Wednesday, September 27, 2000 11:20 AM > Subject: RE: unicode + oracle query... (suggestions needed...) > > > Sandeep, > > I think you need to set the registry charset to UTF8 where database is > installed. We were was getting the same problem when we use to send UTF-8 > strings to oracle database after conversion from Shift-JIS to UTF8. That > time also the byte sequence of the retrieved string is getting changed and > some of the bytes are getting replaced with BF. > > Regards, > > Kedar > > -Original Message- > From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, September 27, 2000 11:36 AM > To: Unicode List > Subject: unicode + oracle query... (suggestions needed...) > > > hi > > actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode > cahracters to an Oracle DB table (varchar2 field)... and then retrieve them > back.. > (i used UTF-8 encoding for both writing to the database and also for > retriving and displaying..) > > there were some amazing observations... > > * each unicode character was taking 7 bytes in the database. (instead of > expected 2 or 3...) > * some unicode characters(or rather code points.) like' F95F' when encoded > in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as > EF A5 9F.. in fact many unicode charcters whose encoded form had to had a > byte in the range (80..9F) were being somehow changed to BF ... thus > resulting in incorrect retrieval > > I was unable to find the reasons for these strange occurrences > Pls suggest what could be the causes for these.. > > regards, > > Sandeep. > > > > > *** > SANDEEP KRISHNA > Member Technical Staff (Priceline.com) > H.C.L. Technologies Limited > A-1 C&D, Sector -16, NOIDA, UP, India. > Ph: 91-11-91-4516321 (extn. 1062) > Fax: 91-11-91-4510713, 4510226 > E-Mail : [EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]> > > >
Re: unicode + oracle query....... (suggestions needed...)
Sandeep Krishna schrieb: > * some unicode characters(or rather code points.) like' F95F' when encoded > in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as > EF A5 9F.. in fact many unicode charcters whose encoded form had to had a > byte in the range (80..9F) were being somehow changed to BF ... thus > resulting in incorrect retrieval Oops, it seems that this particular version of Oracle is only 7,5bit clean ... Hope they fix it soon, otherwise you need UTF-7d5 (inofficial) as a workaround. --J"org Knappen
RE: unicode + oracle query....... (suggestions needed...)
Sandeep, what version of Oracle? What API? Carl -Original Message-From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]Sent: Tuesday, September 26, 2000 8:36 PMTo: Unicode ListSubject: unicode + oracle query... (suggestions needed...) hi actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode cahracters to an Oracle DB table (varchar2 field)... and then retrieve them back.. (i used UTF-8 encoding for both writing to the database and also for retriving and displaying..) there were some amazing observations... * each unicode character was taking 7 bytes in the database. (instead of expected 2 or 3...) * some unicode characters(or rather code points.) like' F95F' when encoded in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as EF A5 9F.. in fact many unicode charcters whose encoded form had to had a byte in the range (80..9F) were being somehow changed to BF ... thus resulting in incorrect retrieval I was unable to find the reasons for these strange occurrences Pls suggest what could be the causes for these.. regards, Sandeep. *** SANDEEP KRISHNAMember Technical Staff (Priceline.com)H.C.L. Technologies LimitedA-1 C&D, Sector -16, NOIDA, UP, India.Ph: 91-11-91-4516321 (extn. 1062)Fax: 91-11-91-4510713, 4510226E-Mail : [EMAIL PROTECTED]
Re: unicode + oracle query....... (suggestions needed...)
hi, thankx for responding. but when u mention change in the registry.. could u elaborate about where exactly in reg and what changes are required my registry setting shows NLS = American_English.UTF8. is this the setting u indicated..or something to so with the charset entry : autodetect and autodetect_all (in classid...Mime>database>charset..) pls do elaborate regards, Sandeep - Original Message - From: Kedar Moghe <[EMAIL PROTECTED]> To: 'Sandeep Krishna' <[EMAIL PROTECTED]> Sent: Wednesday, September 27, 2000 11:20 AM Subject: RE: unicode + oracle query... (suggestions needed...) Sandeep, I think you need to set the registry charset to UTF8 where database is installed. We were was getting the same problem when we use to send UTF-8 strings to oracle database after conversion from Shift-JIS to UTF8. That time also the byte sequence of the retrieved string is getting changed and some of the bytes are getting replaced with BF. Regards, Kedar -Original Message- From: Sandeep Krishna [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 27, 2000 11:36 AM To: Unicode List Subject: unicode + oracle query... (suggestions needed...) hi actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode cahracters to an Oracle DB table (varchar2 field)... and then retrieve them back.. (i used UTF-8 encoding for both writing to the database and also for retriving and displaying..) there were some amazing observations... * each unicode character was taking 7 bytes in the database. (instead of expected 2 or 3...) * some unicode characters(or rather code points.) like' F95F' when encoded in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as EF A5 9F.. in fact many unicode charcters whose encoded form had to had a byte in the range (80..9F) were being somehow changed to BF ... thus resulting in incorrect retrieval I was unable to find the reasons for these strange occurrences Pls suggest what could be the causes for these.. regards, Sandeep. *** SANDEEP KRISHNA Member Technical Staff (Priceline.com) H.C.L. Technologies Limited A-1 C&D, Sector -16, NOIDA, UP, India. Ph: 91-11-91-4516321 (extn. 1062) Fax: 91-11-91-4510713, 4510226 E-Mail : [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
unicode + oracle query....... (suggestions needed...)
hi actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode cahracters to an Oracle DB table (varchar2 field)... and then retrieve them back.. (i used UTF-8 encoding for both writing to the database and also for retriving and displaying..) there were some amazing observations... * each unicode character was taking 7 bytes in the database. (instead of expected 2 or 3...) * some unicode characters(or rather code points.) like' F95F' when encoded in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as EF A5 9F.. in fact many unicode charcters whose encoded form had to had a byte in the range (80..9F) were being somehow changed to BF ... thus resulting in incorrect retrieval I was unable to find the reasons for these strange occurrences Pls suggest what could be the causes for these.. regards, Sandeep. *** SANDEEP KRISHNAMember Technical Staff (Priceline.com)H.C.L. Technologies LimitedA-1 C&D, Sector -16, NOIDA, UP, India.Ph: 91-11-91-4516321 (extn. 1062)Fax: 91-11-91-4510713, 4510226E-Mail : [EMAIL PROTECTED]