Re: Suggestions?

2018-02-22 Thread via Unicode

On 22.02.2018 05:01, David Starner via Unicode wrote:

On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode
<unicode@unicode.org [1]> wrote:


Where can I post suggestions and feedback for Unicode?


Here is as good as any place. There are specific places for a few
specific things, but likely if you do have something thats likely to
get changed, youll need the help of someone here to get through the
process. It is a quarter-century old technical standard embedded in
most electronics, so I would temper any expectations for major
changes; it works the way it works because thats the way previous
versions worked, and nobody is interested in the trouble changing 
them

would involve.



Yes and no. This list is for informal discussion, so someone unsure 
about things may start here, but posting on this list does not count as 
feedback or suggestions to Unicode. So by all means post here some of 
your ideas and understand more.


Regards
John Knightley



Links:
--
[1] mailto:unicode@unicode.org




Re: Suggestions?

2018-02-21 Thread David Starner via Unicode
On Wed, Feb 21, 2018 at 7:55 AM Jeb Eldridge via Unicode <
unicode@unicode.org> wrote:

> Where can I post suggestions and feedback for Unicode?
>

Here is as good as any place. There are specific places for a few specific
things, but likely if you do have something that's likely to get changed,
you'll need the help of someone here to get through the process. It is a
quarter-century old technical standard embedded in most electronics, so I
would temper any expectations for major changes; it works the way it works
because that's the way previous versions worked, and nobody is interested
in the trouble changing them would involve.


Re: Suggestions?

2018-02-21 Thread James Kass via Unicode
http://www.unicode.org/faq/faq_on_faqs.html#34


Re: Suggestions?

2018-02-21 Thread Philippe Verdy via Unicode
The Unicode website has a section for feedback in its menu, but in separate
projects for TUS and for CLDR.
There are also feedbacks requested for every proposed amendment to the
standard, annexes, and data. First search the relevant topic on the
website, then look at the side bar if there's no specific feedback link on
the main page content.
Feedback or proposals are submitted within an online form, and will then be
forwarded by email to interested subcommities and possible subscribers.
For data submission to CLDR, this is done by the survey tool, when it is
open.
For reference implementations, that have an opensourced repository,
feedback is submitted via the links given in the repository itself.

Basically, you need to look for the most relevant topic, and then use the
appropriate link so that this can be sorted and sent to the correct people.
There's also a feedback for questions related to Unicode memberships, or
for legal requests.

There's also a general feedback link, but don't expect an emergency
response, it may take time to reach the right people to get an answer, and
unsorted/unqualified feedbacks take time to be classified and extracted
from the fog of incoming spams or non-relevant submissions.

If you don't know where to post, this mailing list can guide you, but this
is not the place to submit a formal request, and various people (including
me) may reply to you, and any reply you would receive from this list is not
endorsed ofciially by Unicode, this is more a "community" list used to
interconnect interested people and discuss about how to improve the
proposals, or being guided before submitting a qualified formal request, or
ask for peer review before submitting it.

2018-02-21 16:23 GMT+01:00 Jeb Eldridge via Unicode <unicode@unicode.org>:

>
>
>
>
> Where can I post suggestions and feedback for Unicode?
>
>
>
>
>


Re: Suggestions?

2018-02-21 Thread Asmus Freytag via Unicode

  
  
On 2/21/2018 7:23 AM, Jeb Eldridge via
  Unicode wrote:


  
  
  
  
 
 
Where can I post suggestions and feedback
  for Unicode?
 
 
  

What kinds of suggestions / what kind of
feedback are we talking about?
A./

  



Re: Suggestions in Unicode Indic FAQ

2003-02-05 Thread Doug Ewell
Kent Karlsson kentk at md dot chalmers dot se wrote:

 Consider English.  If I write , that may well be a spell error.

Or even Ŋŋŋŋ!, as Michael Everson wrote in WG2 N2306.

-Doug Ewell
 Fullerton, California





RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson

 --- Kent Karlsson [EMAIL PROTECTED] wrote:
   
   No fallback rendering is coming into picture with your explanation. 
  
  Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
  is very unlikely to have a ligature, specially adapted (and fitting)
  adjustment points, or similar.  The rendering would in that sense
  need to use a fallback mechanism that renders an approximation
  for this rare combination.
 
 Do you mean to say that an application has to take care of combination of

s/has to/should, also in display,/

 all other Unicode characters with each combining marks in the fallback

Including multiple combining marks on one base character.

 mechanism for such approximation? Can you count the number of combinations
 which may result in millions!?

Many, many more.  Which is why you need a fallback mechanism (rather
than ligatures, adjustment points, etc. which cannot handle that many
combinations).

In the case of Indic postfix and prefix matras, the general handling is
in principle simple: for the postfix ones, nothing special need be done,
for the prefix ones (i.e. the reordrant ones) do the reordering (before
the preceding base character at least, for certain Indic combinations,
move it even earlier).  Then the you have the visual order.  I'm
ignoring ligature formation here, but that has to be done as well. For
the superscript, subscript, and split matras (and other combining
marks) the general  approach is a bit more complicated.  See
http://www.unicode.org/notes/tn2/ for hints.

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson



  No, with proper reordering (and normal display mode), the e-matra at
  the beginning of the second word would appear to be last glyph of the
  first word.  Similarly, for the second case, the e-matra glyph would
  have come to the left of the pa.  The fluent reader (ok, not me...)
  would then see those errors anyway, just like I can find spelling
  errors in Swedish, most often without any kind of special marking. (I'm
  assuming through-out that reordrant combining characters 
 are reordered.)
 
 Illegal sequences

There are no illegal sequences.

 are not reordered as you indicated.

Then that is a problem with the display software you are using.

 Also, as far as I
 know there is no mention of reordering of illegal input sequence (or
 invalid combining mark) in Unicode standard.

Again, there are no illegal input sequences.

 Consider the last set of glyphs (left-to-right, top-to-bottom) in the
 attached image. It is the rendering effect of illegal input sequence

See above.

 Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka 
 [U+0915] and without any dotted circle.

Let's see if I understand you. 093F, 0915 is the input.  Since
093F is a combining character, one should (not must, but should)
treat this *as if* the input was 0020, 093F, 0915.  Since 093F
is also reordrant, one must reorder it before the preceding base
character (at least, more for consonant clusters), so the output
glyphs would be glyph for 0915, space, glyph for 0915. 
(But your image does not show that.)

 As you might be knowing the correct input
 sequence should be U+0915 followed by U+093F.

That would be a different input (whether that is correct or
not depends on the authors intent).

 In that case the result would
 have been similar to what appears right now. 

Similar ONLY if you disregard the space glyph that should
have been there.

 (Though some more
 sophisticated font/application may want to replace the 
 appearing glyph for
 U+093F to be substituted by some other glyph with proper 
 attachment point).

That may be.

 Now there is no way that user can identify this illegal input sequence
 without dotted circle.

Yes, there is.  Don't disregard the space glyph.

 In the worst case even this rendered glyph is
 attached to the character from a class (for example, 
 consonant cluster of
 Ka Virama Ma) for which the glyph has been designed to 
 render with.
 In such case even a fluent reader can not identify the error.
 
  
  There are spelling errors, yes.  But there are other ways 
 of indicating
  spelling errors, that are (by now) fairly conventional for 
 any language
  (as long as there is an appropriate dictionary installed), 
 and that also
  are more general (in catching more spelling errors) and 
 less obtrusive
  (the author really wants to write it that way, for some reason).
  
   Apparently, Michka used a non-OpenType Bengali Unicode font when
   he embedded the fonts into the page.  As long as you are looking
   at the page on-line, with the embedded fonts, these errors are
   invisible.  
   
   It may be typographically horrible.  It *should* be 
 typographically
   horrible in order to illustrate bad sequences clearly.
  
  I'd prefer little red wiggly lines under the word, or 
 yellow background
  or some such (just for screen display, not for printing; 
 screen grabs
  not counted).  And that for any spelling error.
 
 Spelling mistakes can be categorized into two different classes.

???

 One
 arising from illegal input sequence (e.g., Vowel Sign E as the first
 character in a word)

There are no illegal input sequences.

 and the other one is legal input sequence with no
 contextual meaning in the dictionary.

A simple spell checker just checks if the word is in the 
dictionary or not (without worrying about the context).
That would catch what you call illegal input sequences too.

 While indication of the  second type
 of mistake is generally used only in sophisticated 
 applications like word processor, 

Why?  There is nothing in principle hindering a spell checker
to be used in a plain text editor.

 everyone wants to know the first kind of mistake.

Without a spell checker, but with proper rendering, spelling
errors can be detected by a fluent reader, since they look
different also without any dotted circles. For some ambiguous
Indic cases, like a prefix matra, consonant, postfix matra, all
possible character sequences for them are misspellings (as far
as I know).

 With your
 explanation it seems that even plain text editor is not 
 useful at all to identify such common typing mistakes!

Consider English.  If I write , that may well be a spell error.
Do I deserve to get the rendering of that string to be littered by
dotted circles just because a sequence of four n's has to be
a spell error?

/Kent K

 - Keyur





RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:

  
  Without that dotted circle appearing, the e-matra would appear to
  have been properly encoded, 
 
 No, with proper reordering (and normal display mode), the e-matra at
 the beginning of the second word would appear to be last glyph of the
 first word.  Similarly, for the second case, the e-matra glyph would
 have come to the left of the pa.  The fluent reader (ok, not me...)
 would then see those errors anyway, just like I can find spelling
 errors in Swedish, most often without any kind of special marking. (I'm
 assuming through-out that reordrant combining characters are reordered.)

Illegal sequences are not reordered as you indicated. Also, as far as I
know there is no mention of reordering of illegal input sequence (or
invalid combining mark) in Unicode standard.

Consider the last set of glyphs (left-to-right, top-to-bottom) in the
attached image. It is the rendering effect of illegal input sequence
Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka [U+0915] and
without any dotted circle. As you might be knowing the correct input
sequence should be U+0915 followed by U+093F. In that case the result would
have been similar to what appears right now. (Though some more
sophisticated font/application may want to replace the appearing glyph for
U+093F to be substituted by some other glyph with proper attachment point).
Now there is no way that user can identify this illegal input sequence
without dotted circle. In the worst case even this rendered glyph is
attached to the character from a class (for example, consonant cluster of
Ka Virama Ma) for which the glyph has been designed to render with.
In such case even a fluent reader can not identify the error.

 
 There are spelling errors, yes.  But there are other ways of indicating
 spelling errors, that are (by now) fairly conventional for any language
 (as long as there is an appropriate dictionary installed), and that also
 are more general (in catching more spelling errors) and less obtrusive
 (the author really wants to write it that way, for some reason).
 
  Apparently, Michka used a non-OpenType Bengali Unicode font when
  he embedded the fonts into the page.  As long as you are looking
  at the page on-line, with the embedded fonts, these errors are
  invisible.  
  
  It may be typographically horrible.  It *should* be typographically
  horrible in order to illustrate bad sequences clearly.
 
 I'd prefer little red wiggly lines under the word, or yellow background
 or some such (just for screen display, not for printing; screen grabs
 not counted).  And that for any spelling error.

Spelling mistakes can be categorized into two different classes. One
arising from illegal input sequence (e.g., Vowel Sign E as the first
character in a word) and the other one is legal input sequence with no
contextual meaning in the dictionary. While indication of the second type
of mistake is generally used only in sophisticated applications like word
processor, everyone wants to know the first kind of mistake. With your
explanation it seems that even plain text editor is not useful at all to
identify such common typing mistakes!

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
inline: img1.jpg

RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
  
  No fallback rendering is coming into picture with your explanation. 
 
 Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
 is very unlikely to have a ligature, specially adapted (and fitting)
 adjustment points, or similar.  The rendering would in that sense
 need to use a fallback mechanism that renders an approximation
 for this rare combination.

Do you mean to say that an application has to take care of combination of
all other Unicode characters with each combining marks in the fallback
mechanism for such approximation? Can you count the number of combinations
which may result in millions!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-31 Thread Kent Karlsson

Keyur Shroff wrote:
...
 
 No fallback rendering is coming into picture with your explanation. 

Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
is very unlikely to have a ligature, specially adapted (and fitting)
adjustment points, or similar.  The rendering would in that sense
need to use a fallback mechanism that renders an approximation
for this rare combination.

...
 Here is the para you are talking about.
 
 [Quote]
[...]
 should be rendered as if they had a space as a base character.
 [/Quote]
 
 In the text there is no mention of explicitly inputting space character
 before any combining mark that is defective combining character.

The text says as if. Which I also emphasised before.

 Also, the word should be rendered implies that it is recommendation. 

Yes.  A rather good one.  

  By removing that particular fallback mechanism from implementations
[inserting dotted circle glyphs for allegedly invalid combinations]
  as well as the TUS text!  (I'm serious!) This particular fallback
  mechanism is NOT recommended as it stands.  
 
 Note that the text has been written in the section Implementation
 Guidelines. Can't it be considered as recommendation?

That particular one, no.  Just an example [that isn't very good,
outside of a general show invisibles mode].

  But since its mention is erroneously taken as a recommendation, I'd 
  suggest removing also its mention.
 
 This is disastrous! What will happen to the systems which already
 implemented this recommendations!?

It's not a recommendation.

 Will they be considered invalid
 implementation afterwards? What is about stability?

They are ugly implementations as they are.  And will stay ugly
implementations.  Stability is good ;-).

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson

  I don't know where you find support for that position in that text.
  Can you please quote?  There are no invalid base consonants for
  any dependent vowel (for Indic scripts; similarly for any 
  other script).
 
 Actually, there is a mention of displaying combining marks on dotted
 circles:

I know.  But there is no mention (that I have found) of invalid base
characters or any recommendation for using dotted circles especially
for Indic scripts.

 I add that this is a good way of displaying a combining mark that has no
 base character, i.e. one occurring at the begin of a line or paragraph.

No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121).

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson

 Let me give a proper example this time. Consider a Vowel Sign E [U+0947]
 appearing after any non-consonant character. This sign is generally
 attached to the consonants. It has zero advance width with negative left
 side bearing in the font. 

Ok.

 Clearly, since in this case the sign is not
 preceded by any consonant base, it has to be rendered using one of the
 mechanisms specified in fallback rendering of non-spacing marks.

If it is preceded by a SPACE (or is first in a string/paragraph/similar)
it should be rendered as a freestanding glyph (no dotted circle).  If it
is preceded, in the source string, by, say, FULL STOP, a typographically
acceptable rendering would be to have the vowel sign E glyph float on
top of the glyph for the FULL STOP (no dotted circle).  Similarly for a
vowel sign E that follows a LATIN CAPITAL LETTER A. (But I don't expect
good positioning, just readable.) Again similarly, a vowel sign I that
follows an EQUAL SIGN should be rendered as a vowel sign I glyph to the
left of an EQUAL SIGN glyph.  No dotted circle. (I know that the reordrant
vowel signs may reorder over more than the preceding base character IF it
is a (sub)string in an Indic script.) Again similarly, a KA, II, II
string should be rendered as a KA + II + II glyph sequence (invoking
any ligature for KA + II if there is one in the font; II + II is
unlikely to have any ligature, since it is not used by any orthography). 
No dotted circle(s). The fallback hinted at in TUS 3.0 that uses dotted
circles is 1) typographically horrible, and 2) cannot indicate that
there is any error in the given character sequence.

...
 the application. Now in order to render it with dotted circle if we
 introduce the circle in the text before this sign then also 
 the circle is invalid base for this Vowel Sign E.

No base character is invalid for any combining character.

...
  Languages or syllable boundaries have nothing to do with this. These
  special
  sequences should *never* be part of any syllable or word in any language:
  they are just a way of showing the shape of a glyph, to be used when,
  e.g., talking about typography or spelling.
 
 Then how can we rake care of fallback mechanism?

By removing that particular fallback mechanism from implementations
as well as the TUS text!  (I'm serious!) This particular fallback
mechanism is NOT recommended as it stands.  But since its mention is
erroneously taken as a recommendation, I'd suggest removing also its
mention.  That mechanism is as bad as misplacing glyphs for combining
marks on the glyph(s) for the follow-on character, if not worse.
(Show invisibles (for all of the text or a user selected run
of the text) is an entirely different story.)

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Marco Cimarosti
Keyur Shroff wrote:
  However, I totally agree with Kent that this funny 
 rendering is *not* a
  requirement of the Unicode standard, as Keyur Shroff seems 
 to suggest. It
  is just an example of many several methods [that] are 
 available to deal
  with strange sequences.
 
 A sequence should not be treated as strange sequence if it has been
 written intentionally. It may have some contextual meaning.

I said strange in the sense of character sequences that are not part of
the ordinary spelling of any language. In fact, a thing like a matra
floating in the air or on a dotted circle is something that you'd only see
in a text (not necessarily *in* an Indian language) which talks about
spelling, character sets, and the like.

 Also, what is good or bad is also subjective. It may also 
 vary from one script to another.

Yes, but what is mandatory and what is not in Unicode sciould not be too
much subjective, else we could not call it a standard.

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread John Hudson
At 01:20 AM 1/30/2003, Marco Cimarosti wrote:


However, I totally agree with Kent that this funny rendering is *not* a
requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is
just an example of many several methods [that] are available to deal with
strange sequences.


Perhaps there is some confusion here because the use of the dotted circle 
is an explicit recommendation of the MS Indic OpenType spec, which is what 
the majority of Indian font developers are now working with. So there may 
be some confusion between what is expected in the MS spec and what is 
required by Unicode.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread jameskass
.
Kent Karlsson wrote,

  I add that this is a good way of displaying a combining mark that has no
  base character, i.e. one occurring at the begin of a line or paragraph.
 
 No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121).

So it says.  But, the 'space method' could be interpreted as being under
the Simple Overlap fallback rendering method, since the paragraph from
which you are quoting appears immediately after the paragraph treating 
with Simple Overlap method.

In the first paragraph under Fallback Rendering (bottom of page 120), 
the Show Hidden method is described.  It would have been redundant to 
say that degenerate cases **under the Show Hidden method** need to
be displayed as if on a dotted circle, since the Show Hidden method
is **all about displaying on dotted circles** whenever there is an
inability to draw the sequence.  Like, in the event of an invalid
sequence.

Whether to use Show Hidden or Simple Overlap method to display 
invalid or degenerate sequences should be left up to the various
software developers.

It does seem to be very easy to spot bad input when the dotted circle
appears in the display.  Stands out like a sore thumb, which was
probably the intention.

Perhaps this section of T.U.S. could stand some clarification.

Best regards,

James Kass
.




Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff
Hello,

There are few discrepancies in Indic FAQ. Though it was reported earlier by
Andy White, I see they still have place there in the FAQ. I also clarified
it but by mistake I sent the mail to Yahoo groups where this mailing list
is archived and hence my mail never reached to this mailing list. You can
refer to the link http://groups.yahoo.com/group/unicode/message/16352


The following are the suggestions.

SUGGESTION-1:

In the FAQ
   http://www.unicode.org/faq/indic.html#2
it is mentioned that 

ISCII:   Unicode:
Halant + Halant  Halant + ZWJ

produce similar result. This is wrong. In ISCII, Halant+Halant is known as
explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ
should be replaced by ZWNJ.


SUGGESTION-2:

In the FAQ
   http://www.unicode.org/faq/indic.html#16

It is mentioned that following are equivalent

ISCII Unicode
KA halant INV KA virama ZWJ
RA halant INV RAsup (i.e., repha)

In fact there is no way in Unicode to produce RAsup directly, i.e., without
using base consonant. The sequence RA virama ZWJ will actually produce
half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can
also be produced with the sequence RA Halant Nukta sequence both in ISCII
(known as soft halant) and Unicode (just for conformance with ISCII).

Also, in the same answer the following sequence is recommended.

ISCII Unicode
INV halant RA SPACE virama RA (RAsub)



SUGGESTION-3:

Use of SPACE character as consonant may create problem for state machine
which finds language/syllable boundary. In fact we need a codepoint for one
invisible consonant (similar to INV in ISCII) in Unicode which can solve
this problem with Unicode.

After inclusion of INV character the following can be recommended.

ISCII Unicode
KA halant INV KA virama INV
RA halant INV RA virama INV (i.e., repha)
INV halant RA INV virama RA (RAsub)

The INV character in Unicode can also be used for displaying dependent
vowel matras without dotted circle.

Unicode
INV Vowel sign O
INV Vowel sign AI

etc. This can replace existing definition of SPACE as invisible consonant
depending on the context.

Any other pointers!!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti
Keyur Shroff wrote:
 In the FAQ
http://www.unicode.org/faq/indic.html#16
 
 It is mentioned that following are equivalent
 
 ISCII Unicode
 KA halant INV KA virama ZWJ
 RA halant INV RAsup (i.e., repha)

The last line is really bizarre! I would agree that it is plain wrong...

What is supposed to appear in column Unicode is the Unicode *encoding*
equivalent to the RA halant INV in the ISCII column. But RAsup (i.e.,
repha) is the description of a *glyph*.

 In fact there is no way in Unicode to produce RAsup directly, 
 i.e., without using base consonant. [...]

I agree. This issue has been raised several times, and several viable
solutions have been proposed, but I don't remember that Unicode officials
ever showed to even acknowledge the problem.

But probably this has been noted down and discussed. I hope to see an
official solution in TUS 4.0.

 SUGGESTION-3:
 
 Use of SPACE character as consonant may create problem for 
 state machine which finds language/syllable boundary.
 In fact we need a codepoint for one invisible consonant
 (similar to INV in ISCII) in Unicode which can solve
 this problem with Unicode.
 
 After inclusion of INV character the following can be recommended.
 
 ISCII Unicode
 KA halant INV KA virama INV
 RA halant INV RA virama INV (i.e., repha)
 INV halant RA INV virama RA (RAsub)

Why not representing INV with a double ZWJ? E.g.:

ISCII Unicode
KA halant INV KA virama ZWJ ZWJ
RA halant INV RA virama ZWJ ZWJ (i.e., repha)
INV halant RA ZWJ ZWJ virama RA (RAsub)

This has the advantage that the most common sequences will work OK also on
old display engines implemented *before* the double-ZWJ convention is
introduced.

E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for the
simple reason that the first ZWJ is enough to do the work, and  the second
ZWJ is invisible.

Of course, an old engine will still display a RA[eyelash] for RA virama
ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a
white box, which is what would happen with your new INV character.

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson

 The [new] INV character in Unicode can also be used for displaying dependent
 vowel matras without dotted circle.

A space followed by a dependent vowel sign should display just the
dependent vowel sign, no dotted circle.  Indeed, (except for a show
invisibles mode, or a character chart display mode) no (Indic or other)
text that does not contain the *character* DOTTED CIRCLE should ever
display a dotted circle as part of the displayed text. Systems that
do display a dotted circle (in normal display mode) where there is
no such *character* in the displayed text are buggy!

/Kent K

(B.t.w. the chart dotted circle glyph for combining characters
look a bit different from the (normal) glyph for DOTTED CIRLCE.)




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:

 Why not representing INV with a double ZWJ? E.g.:
 
   ISCII Unicode
   KA halant INV KA virama ZWJ ZWJ
   RA halant INV RA virama ZWJ ZWJ (i.e., repha)
   INV halant RA ZWJ ZWJ virama RA (RAsub)
 
 This has the advantage that the most common sequences will work OK also
 on
 old display engines implemented *before* the double-ZWJ convention is
 introduced.
 
 E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for
 the
 simple reason that the first ZWJ is enough to do the work, and  the
 second ZWJ is invisible.
 
 Of course, an old engine will still display a RA[eyelash] for RA
 virama
 ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a
 white box, which is what would happen with your new INV character.

Certainly. This looks more promising because even RAsub has two alternate
forms. One form is used with consonants KA, KHA, GHA, etc and the other
form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based
scheme we can insert as many ZWJ as we wish to produce all possible
alternate forms!

But sometimes a user may want visual representation of these symbols in two
different ways: with dotted circle and without dotted circle. Example of
this could be RAsup on top of dotted circle and RAsup on top of space
character. Current use of space character to eliminate dotted circle is
really painful and may create problems in determining language and syllable
boundaries. The main problem with space character is that unlike
ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script
Latin. Finally it may affect all important text processing which uses
Unicode characters to find language boundaries. Use of INV character in one
shot can solve all these problems. We can put it in consonant class which
can help text processing applications. Moreover, it will be difficult for
all possible to provide upward compatibility all the time even though it is
desirable. Implementation of Unicode will need to be upgraded with every
introduction of new glyphs or rules. Otherwise applications have to
explicitly declare the version of Unicode used in implementation.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
 
 A space followed by a dependent vowel sign should display just the
 dependent vowel sign, no dotted circle.  Indeed, (except for a show
 invisibles mode, or a character chart display mode) no (Indic or
 other)
 text that does not contain the *character* DOTTED CIRCLE should ever
 display a dotted circle as part of the displayed text. Systems that
 do display a dotted circle (in normal display mode) where there is
 no such *character* in the displayed text are buggy!

In Indic scripts any sign that appear in text not in conjunction with a
valid consonant base may be rendered with dotted circle as fallback
mechanism (Section 5.14 Rendering Nonspacing Marks
http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as
default behaviour should not be considered buggy. What should be the
default rendering behaviour (i.e., show hidden or not) may vary from one
script to another script and also depends on implementation policy. 

For scripts other than Indic scripts, it may be useful to render the
nonspacing mark without dotted circle because even after rendering it as an
overlap glyph, the result is recognizable. However, for Indic scripts use
of dotted circle is very useful as default behaviour since it gives
immediate feedback to the user that there may be some defective combining
character in the text. Most of the time such errors are unintentional
rather than intentional.

Unicode has provision to remove this dotted circle. Space character is used
to give indication to fallback mechanism that no dotted circle should be
used while rendering this stand alone sign which is normally attached to
other characters. This is useful when sometimes user want to display the
sign without any circle. Also, with this scheme it is possible to show some
combining marks with dotted circle and some without dotted circle.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Marco Cimarosti
Keyur Shroff wrote:
 But sometimes a user may want visual representation of these 
 symbols in two different ways: with dotted circle and
 without dotted circle.

Why not using a dotted circle character explicity, when you want to see one?

 Example of
 this could be RAsup on top of dotted circle and RAsup on top of space
 character. Current use of space character to eliminate dotted 
 circle is really painful and may create problems in determining 
 language and syllable boundaries.

Languages or syllable boundaries have nothing to do with this. These special
sequences should *never* be part of any syllabe or word in any language:
they are just a way of showing the shape of a glyph, to be used when, e.g.,
talking about typography or spelling.

 The main problem with space character is that unlike
 ZWJ/ZWNJ/Dotted Circle, it falls within the range of other 
 important script Latin. 

Plain wrong! White-space characters and punctuation do not belong to any
script: character such as  , ! and ? are used for many scripts and
languages. Even the danda punctuation, which is in the Devanagari range,
does not belong to Devanagari: it is also used for other Indic scripts.

 Use of INV character in one shot can solve all these
 problems. We can put it in consonant class which
 can help text processing applications. [...]

How can calling a consonant something which has nothing to do with
consonants help anybody doing anything?

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson

Keyur Shroff wrote
 Kent Karlsson [EMAIL PROTECTED] wrote:
  
  A space followed by a dependent vowel sign should display just the
  dependent vowel sign, no dotted circle.  Indeed, (except for a show
  invisibles mode, or a character chart display mode) no (Indic or
  other)
  text that does not contain the *character* DOTTED CIRCLE should ever
  display a dotted circle as part of the displayed text. Systems that
  do display a dotted circle (in normal display mode) where there is
  no such *character* in the displayed text are buggy!
 
 In Indic scripts any sign that appear in text not in 
 conjunction with a
 valid consonant base may be rendered with dotted circle as fallback
 mechanism (Section 5.14 Rendering Nonspacing Marks
 http://www.unicode.org/uni2book/ch05.pdf).

I don't know where you find support for that position in that text.
Can you please quote?  There are no invalid base consonants for
any dependent vowel (for Indic scripts; similarly for any other script).

 Any system implementing this as
 default behaviour should not be considered buggy.

Indeed they are.  And it should certainly not be default behaviour.

Any combining characters can be placed on any base characters without
there being any dotted circles displayed.  In particular, any combining
Devanagari characters (note: including, in principle, several dependent
vowels, even if that does not occur in any (existing) orthography) can
be placed on any Devanagari base character as well as SPACE (and other
punctuation). What should result is a reasonable composed glyph, no
dotted circle in sight (except in show invisibles mode, which I'm not
discussing here). Spelling errors should be indicated otherwise, since
they are of a very different nature.

 For scripts other than Indic scripts, it may be useful to render the
 nonspacing mark without dotted circle because even after 
 rendering it as an
 overlap glyph, the result is recognizable. However, for Indic 
 scripts use
 of dotted circle is very useful as default behaviour since it gives
 immediate feedback to the user that there may be some 
 defective combining
 character in the text. Most of the time such errors are unintentional
 rather than intentional.

No combination of base + combining characters is defective per se.
Even if the scripts are different within the combining sequence.
(Note also that the 0300 block of combining characters are script
independent.) Spelling errors is something else entirely.

 Unicode has provision to remove this dotted circle.

I'm not sure what you are talking about here.

 Space 
 character is used
 to give indication to fallback mechanism that no dotted 
 circle should be
 used while rendering this stand alone sign which is normally 
 attached to
 other characters. This is useful when sometimes user want to 
 display the
 sign without any circle. Also, with this scheme it is 
 possible to show some
 combining marks with dotted circle and some without dotted circle.

The fallback mechanisms talked about in section 5.14 of TUS 3.0 is
the use of less than ideal (typographically!) mechanisms to display
an *approximation* of the glyph(s) for the combining sequence.

An exceedingly bad approximation is displaying a dotted circle as a
fake base (again: disregarding show invisibles, or chart modes,
which, however, should be consistent and show a dotted circle fake
base for ALL combining characters occurring in the text).  The use
of this exceedingly bad approximation (in normal display mode) does
in no way indicate that the combining sequence is at all defective.
It may indicate that the display engine (or the font) is defective...

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:
 Keyur Shroff wrote:
  But sometimes a user may want visual representation of these 
  symbols in two different ways: with dotted circle and
  without dotted circle.
 
 Why not using a dotted circle character explicity, when you want to see
 one?

Note that whenever I mention the word combining mark I am really talking
about vowel signs (matras) and other modifiers in Indic scripts which is
script dependent. I am sorry if I have confused you with the combining
diacritical marks in the block [U+0300-U+036F] which I really didn't mean.

Let me give a proper example this time. Consider a Vowel Sign E [U+0947]
appearing after any non-consonant character. This sign is generally
attached to the consonants. It has zero advance width with negative left
side bearing in the font. Clearly, since in this case the sign is not
preceded by any consonant base, it has to be rendered using one of the
mechanisms specified in fallback rendering of non-spacing marks. If we
render it with space, as you said, then we have to insert space character
at the time of fallback rendering (which can be taken care in rendering
pipeline) even though space character is not present in backing store of
the application. Now in order to render it with dotted circle if we
introduce the circle in the text before this sign then also the circle is
invalid base for this Vowel Sign E. As a result, again fallback rendering
will take place with rendering circle and the vowel sign positionally
separate. In this case first dotted circle will apear which will be
followed by vowel sign (matra) on top of space character.

If you know any other way to solve this problem then please explain. Also
let me know if I have misinterpreted the text written in Unicode standard.


 
  Example of
  this could be RAsup on top of dotted circle and RAsup on top of space
  character. Current use of space character to eliminate dotted 
  circle is really painful and may create problems in determining 
  language and syllable boundaries.
 
 Languages or syllable boundaries have nothing to do with this. These
 special
 sequences should *never* be part of any syllabe or word in any language:
 they are just a way of showing the shape of a glyph, to be used when,
 e.g., talking about typography or spelling.

Then how can we rake care of fallback mechanism?


Thanks for taking pain for answering my queries :-)

- Keyur



__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Suggestions for next print edition

2001-12-03 Thread juuichiketajin


 You can always search the big Unihan.txt file on the kJapaneseKun
 and kJapaneseOn fields, which provide whatever information we have
 on pronunciation of the characters in Japanese.
 
 If you are just stuck looking up stuff because it isn't marked up
 for Japanese, try getting Sanseido's Unicode Kanji
 Information Dictionary, which has the first 20,902 kanji in Unicode
 (the most useful set) all marked up with all the Japanese pronunciations
 (where they have any). 

The first suggestion is useless. The file is too freaking big so maybe I'll go with 
the second. Thanks.

-- 

___
Get your free email from http://www.ranmamail.com

Powered by Outblaze




Suggestions for next print edition

2001-12-02 Thread juuichiketajin

1. Unicode points are NUMBERS. Numbers can be written in ANY base. Knowing decimal 
values of codepoints is sometimes useful, so please print them in the next edition of 
the Unicode book.

2. There was a Shift-JIS index for kanji. I don't know much about kanji, but it seems 
to me that they are arranged in a-i-u-e-o order of on'yomi. Why not print little 
hiragana letters at the top to aid people searching for a kanji?

Remember how I could not find the ran of randamu before? Let's see this time... 
Aha! There is is!
I know it was somewhere between mo(kuyoubi) and (fu)ro. Better than stroke / 
radical, I wonder?
* Disclaimer: From what I hear, the Japanese do NOT write randamu as U+4E71 U+3060 
U+3080. They use U+30E9 U+30F3 U+30C0 U+30E0. But the first is cuter. ^_^
-- 

___
Get your free email from http://www.ranmamail.com

Powered by Outblaze




Re: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Sandeep Krishna

hi,

thankx for responding.

but when u mention change in the registry..
could u elaborate about where exactly in reg and what changes are required

my registry setting shows NLS = American_English.UTF8.

is this the setting u indicated..or something to so with the charset entry :
autodetect and autodetect_all (in classid...Mimedatabasecharset..)

pls do elaborate

regards,

Sandeep



- Original Message -
From: Kedar Moghe [EMAIL PROTECTED]
To: 'Sandeep Krishna' [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 11:20 AM
Subject: RE: unicode + oracle query... (suggestions needed...)


Sandeep,

I think you need to set the registry charset to UTF8 where database is
installed. We were was getting the same problem when we use to send UTF-8
strings to oracle database after conversion from Shift-JIS to UTF8. That
time also the byte sequence of the retrieved string is getting changed and
some of the bytes are getting replaced with BF.

Regards,

Kedar

-Original Message-
From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 27, 2000 11:36 AM
To: Unicode List
Subject: unicode + oracle query... (suggestions needed...)


hi

actually i have been trying to use ASPs (UTF-8 encoding..) to write unicode
cahracters to an Oracle DB table (varchar2 field)... and then retrieve them
back..
(i used UTF-8 encoding for both writing to the database and also for
retriving and displaying..)

there were some amazing observations...

* each  unicode character was taking 7 bytes in the database. (instead of
expected 2 or 3...)
* some unicode characters(or rather code points.) like' F95F' when encoded
in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as
EF A5 9F..  in fact many unicode charcters whose encoded form had to had a
byte in the range (80..9F) were being somehow changed to BF ... thus
resulting in incorrect retrieval

I was unable to find the reasons for these strange occurrences
Pls suggest what could be the causes for these..

regards,

Sandeep.




***
SANDEEP KRISHNA
Member Technical Staff (Priceline.com)
H.C.L. Technologies Limited
A-1 CD, Sector -16, NOIDA, UP, India.
Ph:  91-11-91-4516321 (extn. 1062)
Fax: 91-11-91-4510713, 4510226
E-Mail : [EMAIL PROTECTED]
mailto:[EMAIL PROTECTED]






Re: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Sandeep Krishna

hi...

i m thoroughly confused.
actually the registry entries for oracle shows 3 entries for NLS_LANG.
and that too at the WEB SERVER end and at the DATABASE SERVER end.
so that makes tooo many combinations...

can someone indicate which of these NLS_LANG entries have to be set as
"AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly
should be there

pls suggest necessary messures..

regards,

Sandeep




- Original Message -
From: Bob Verbrugge [EMAIL PROTECTED]
To: Sandeep Krishna [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 1:30 PM
Subject: Re: unicode + oracle query... (suggestions needed...)


Sandeep,

You probably need to change the NLS_LANG Oracle setting in the registry.
Look under
HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the
character set part to UTF8.

Bob.


- Original Message -
From: "Sandeep Krishna" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 9:16 AM
Subject: Re: unicode + oracle query... (suggestions needed...)


 hi,

 thankx for responding.

 but when u mention change in the registry..
 could u elaborate about where exactly in reg and what changes are required

 my registry setting shows NLS = American_English.UTF8.

 is this the setting u indicated..or something to so with the charset entry
:
 autodetect and autodetect_all (in classid...Mimedatabasecharset..)

 pls do elaborate

 regards,

 Sandeep



 - Original Message -
 From: Kedar Moghe [EMAIL PROTECTED]
 To: 'Sandeep Krishna' [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 11:20 AM
 Subject: RE: unicode + oracle query... (suggestions needed...)


 Sandeep,

 I think you need to set the registry charset to UTF8 where database is
 installed. We were was getting the same problem when we use to send UTF-8
 strings to oracle database after conversion from Shift-JIS to UTF8. That
 time also the byte sequence of the retrieved string is getting changed and
 some of the bytes are getting replaced with BF.

 Regards,

 Kedar

 -Original Message-
 From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 27, 2000 11:36 AM
 To: Unicode List
 Subject: unicode + oracle query... (suggestions needed...)


 hi

 actually i have been trying to use ASPs (UTF-8 encoding..) to write
unicode
 cahracters to an Oracle DB table (varchar2 field)... and then retrieve
them
 back..
 (i used UTF-8 encoding for both writing to the database and also for
 retriving and displaying..)

 there were some amazing observations...

 * each  unicode character was taking 7 bytes in the database. (instead of
 expected 2 or 3...)
 * some unicode characters(or rather code points.) like' F95F' when encoded
 in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
as
 EF A5 9F..  in fact many unicode charcters whose encoded form had to had a
 byte in the range (80..9F) were being somehow changed to BF ... thus
 resulting in incorrect retrieval

 I was unable to find the reasons for these strange occurrences
 Pls suggest what could be the causes for these..

 regards,

 Sandeep.





 ***
 SANDEEP KRISHNA
 Member Technical Staff (Priceline.com)
 H.C.L. Technologies Limited
 A-1 CD, Sector -16, NOIDA, UP, India.
 Ph:  91-11-91-4516321 (extn. 1062)
 Fax: 91-11-91-4510713, 4510226
 E-Mail : [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED]







RE: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Kedar Moghe

Sandeep,

I think you need to change at following three places,
HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG

Best of luck

Regards,

Kedar

-Original Message-
From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 27, 2000 5:45 PM
To: Carl W. Brown; Bob Verbrugge; Kedar Moghe
Cc: [EMAIL PROTECTED]
Subject: Re: unicode + oracle query... (suggestions needed...)


hi...

i m thoroughly confused.
actually the registry entries for oracle shows 3 entries for NLS_LANG.
and that too at the WEB SERVER end and at the DATABASE SERVER end.
so that makes tooo many combinations...

can someone indicate which of these NLS_LANG entries have to be set as
"AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly
should be there

pls suggest necessary messures..

regards,

Sandeep




- Original Message -
From: Bob Verbrugge [EMAIL PROTECTED]
To: Sandeep Krishna [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 1:30 PM
Subject: Re: unicode + oracle query... (suggestions needed...)


Sandeep,

You probably need to change the NLS_LANG Oracle setting in the registry.
Look under
HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the
character set part to UTF8.

Bob.


- Original Message -
From: "Sandeep Krishna" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 9:16 AM
Subject: Re: unicode + oracle query... (suggestions needed...)


 hi,

 thankx for responding.

 but when u mention change in the registry..
 could u elaborate about where exactly in reg and what changes are required

 my registry setting shows NLS = American_English.UTF8.

 is this the setting u indicated..or something to so with the charset entry
:
 autodetect and autodetect_all (in classid...Mimedatabasecharset..)

 pls do elaborate

 regards,

 Sandeep



 - Original Message -
 From: Kedar Moghe [EMAIL PROTECTED]
 To: 'Sandeep Krishna' [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 11:20 AM
 Subject: RE: unicode + oracle query... (suggestions needed...)


 Sandeep,

 I think you need to set the registry charset to UTF8 where database is
 installed. We were was getting the same problem when we use to send UTF-8
 strings to oracle database after conversion from Shift-JIS to UTF8. That
 time also the byte sequence of the retrieved string is getting changed and
 some of the bytes are getting replaced with BF.

 Regards,

 Kedar

 -Original Message-
 From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 27, 2000 11:36 AM
 To: Unicode List
 Subject: unicode + oracle query... (suggestions needed...)


 hi

 actually i have been trying to use ASPs (UTF-8 encoding..) to write
unicode
 cahracters to an Oracle DB table (varchar2 field)... and then retrieve
them
 back..
 (i used UTF-8 encoding for both writing to the database and also for
 retriving and displaying..)

 there were some amazing observations...

 * each  unicode character was taking 7 bytes in the database. (instead of
 expected 2 or 3...)
 * some unicode characters(or rather code points.) like' F95F' when encoded
 in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
as
 EF A5 9F..  in fact many unicode charcters whose encoded form had to had a
 byte in the range (80..9F) were being somehow changed to BF ... thus
 resulting in incorrect retrieval

 I was unable to find the reasons for these strange occurrences
 Pls suggest what could be the causes for these..

 regards,

 Sandeep.





 ***
 SANDEEP KRISHNA
 Member Technical Staff (Priceline.com)
 H.C.L. Technologies Limited
 A-1 CD, Sector -16, NOIDA, UP, India.
 Ph:  91-11-91-4516321 (extn. 1062)
 Fax: 91-11-91-4510713, 4510226
 E-Mail : [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED]






Re: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Sandeep Krishna

i mean all the entries at both Web server machine's registry and Oracle
Database server machine's registry or either one.
in our setup... my machine is the Web Server and the Oracle Server is a
separate machine
please clarify

regards,

Sandeep
- Original Message -
From: Kedar Moghe [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 3:21 PM
Subject: RE: unicode + oracle query... (suggestions needed...)


Sandeep,

I think you need to change at following three places,
HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG

Best of luck

Regards,

Kedar

-Original Message-
From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 27, 2000 5:45 PM
To: Carl W. Brown; Bob Verbrugge; Kedar Moghe
Cc: [EMAIL PROTECTED]
Subject: Re: unicode + oracle query... (suggestions needed...)


hi...

i m thoroughly confused.
actually the registry entries for oracle shows 3 entries for NLS_LANG.
and that too at the WEB SERVER end and at the DATABASE SERVER end.
so that makes tooo many combinations...

can someone indicate which of these NLS_LANG entries have to be set as
"AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly
should be there

pls suggest necessary messures..

regards,

Sandeep




- Original Message -
From: Bob Verbrugge [EMAIL PROTECTED]
To: Sandeep Krishna [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 1:30 PM
Subject: Re: unicode + oracle query... (suggestions needed...)


Sandeep,

You probably need to change the NLS_LANG Oracle setting in the registry.
Look under
HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the
character set part to UTF8.

Bob.


- Original Message -
From: "Sandeep Krishna" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 9:16 AM
Subject: Re: unicode + oracle query... (suggestions needed...)


 hi,

 thankx for responding.

 but when u mention change in the registry..
 could u elaborate about where exactly in reg and what changes are required

 my registry setting shows NLS = American_English.UTF8.

 is this the setting u indicated..or something to so with the charset entry
:
 autodetect and autodetect_all (in classid...Mimedatabasecharset..)

 pls do elaborate

 regards,

 Sandeep



 - Original Message -
 From: Kedar Moghe [EMAIL PROTECTED]
 To: 'Sandeep Krishna' [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 11:20 AM
 Subject: RE: unicode + oracle query... (suggestions needed...)


 Sandeep,

 I think you need to set the registry charset to UTF8 where database is
 installed. We were was getting the same problem when we use to send UTF-8
 strings to oracle database after conversion from Shift-JIS to UTF8. That
 time also the byte sequence of the retrieved string is getting changed and
 some of the bytes are getting replaced with BF.

 Regards,

 Kedar

 -Original Message-
 From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 27, 2000 11:36 AM
 To: Unicode List
 Subject: unicode + oracle query... (suggestions needed...)


 hi

 actually i have been trying to use ASPs (UTF-8 encoding..) to write
unicode
 cahracters to an Oracle DB table (varchar2 field)... and then retrieve
them
 back..
 (i used UTF-8 encoding for both writing to the database and also for
 retriving and displaying..)

 there were some amazing observations...

 * each  unicode character was taking 7 bytes in the database. (instead of
 expected 2 or 3...)
 * some unicode characters(or rather code points.) like' F95F' when encoded
 in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
as
 EF A5 9F..  in fact many unicode charcters whose encoded form had to had a
 byte in the range (80..9F) were being somehow changed to BF ... thus
 resulting in incorrect retrieval

 I was unable to find the reasons for these strange occurrences
 Pls suggest what could be the causes for these..

 regards,

 Sandeep.





 ***
 SANDEEP KRISHNA
 Member Technical Staff (Priceline.com)
 H.C.L. Technologies Limited
 A-1 CD, Sector -16, NOIDA, UP, India.
 Ph:  91-11-91-4516321 (extn. 1062)
 Fax: 91-11-91-4510713, 4510226
 E-Mail : [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED]







RE: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Kedar Moghe

Only registry entries on the database machine. Not any other entry.

Regsrds,

Kedar

-Original Message-
From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 27, 2000 6:21 PM
To: Kedar Moghe
Cc: [EMAIL PROTECTED]
Subject: Re: unicode + oracle query... (suggestions needed...)


i mean all the entries at both Web server machine's registry and Oracle
Database server machine's registry or either one.
in our setup... my machine is the Web Server and the Oracle Server is a
separate machine
please clarify

regards,

Sandeep
- Original Message -
From: Kedar Moghe [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 3:21 PM
Subject: RE: unicode + oracle query... (suggestions needed...)


Sandeep,

I think you need to change at following three places,
HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG
HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG

Best of luck

Regards,

Kedar

-Original Message-
From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 27, 2000 5:45 PM
To: Carl W. Brown; Bob Verbrugge; Kedar Moghe
Cc: [EMAIL PROTECTED]
Subject: Re: unicode + oracle query... (suggestions needed...)


hi...

i m thoroughly confused.
actually the registry entries for oracle shows 3 entries for NLS_LANG.
and that too at the WEB SERVER end and at the DATABASE SERVER end.
so that makes tooo many combinations...

can someone indicate which of these NLS_LANG entries have to be set as
"AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what exactly
should be there

pls suggest necessary messures..

regards,

Sandeep




- Original Message -
From: Bob Verbrugge [EMAIL PROTECTED]
To: Sandeep Krishna [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 1:30 PM
Subject: Re: unicode + oracle query... (suggestions needed...)


Sandeep,

You probably need to change the NLS_LANG Oracle setting in the registry.
Look under
HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the
character set part to UTF8.

Bob.


- Original Message -
From: "Sandeep Krishna" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 9:16 AM
Subject: Re: unicode + oracle query... (suggestions needed...)


 hi,

 thankx for responding.

 but when u mention change in the registry..
 could u elaborate about where exactly in reg and what changes are required

 my registry setting shows NLS = American_English.UTF8.

 is this the setting u indicated..or something to so with the charset entry
:
 autodetect and autodetect_all (in classid...Mimedatabasecharset..)

 pls do elaborate

 regards,

 Sandeep



 - Original Message -
 From: Kedar Moghe [EMAIL PROTECTED]
 To: 'Sandeep Krishna' [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 11:20 AM
 Subject: RE: unicode + oracle query... (suggestions needed...)


 Sandeep,

 I think you need to set the registry charset to UTF8 where database is
 installed. We were was getting the same problem when we use to send UTF-8
 strings to oracle database after conversion from Shift-JIS to UTF8. That
 time also the byte sequence of the retrieved string is getting changed and
 some of the bytes are getting replaced with BF.

 Regards,

 Kedar

 -Original Message-
 From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 27, 2000 11:36 AM
 To: Unicode List
 Subject: unicode + oracle query... (suggestions needed...)


 hi

 actually i have been trying to use ASPs (UTF-8 encoding..) to write
unicode
 cahracters to an Oracle DB table (varchar2 field)... and then retrieve
them
 back..
 (i used UTF-8 encoding for both writing to the database and also for
 retriving and displaying..)

 there were some amazing observations...

 * each  unicode character was taking 7 bytes in the database. (instead of
 expected 2 or 3...)
 * some unicode characters(or rather code points.) like' F95F' when encoded
 in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
as
 EF A5 9F..  in fact many unicode charcters whose encoded form had to had a
 byte in the range (80..9F) were being somehow changed to BF ... thus
 resulting in incorrect retrieval

 I was unable to find the reasons for these strange occurrences
 Pls suggest what could be the causes for these..

 regards,

 Sandeep.





 ***
 SANDEEP KRISHNA
 Member Technical Staff (Priceline.com)
 H.C.L. Technologies Limited
 A-1 CD, Sector -16, NOIDA, UP, India.
 Ph:  91-11-91-4516321 (extn. 1062)
 Fax: 91-11-91-4510713, 4510226
 E-Mail : [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED]






Re: unicode + oracle query....... (suggestions needed...)

2000-09-27 Thread Michael \(michka\) Kaplan

Sandeep,

Can you explain exactly what you are doing to get the data from ASP into the
Oracle database? Perhaps post the ASP code? Like most scriptoing languages,
VBScript and JScript both support UCS-2, and it is really usually the Oracle
ODBC or OLE DB driver that has the job of converting the text from UCS-2 to
UTF-8. I would wonder if what you are seeing is some type of "double
conversion?"

So the things that would be interesting to know:

1) The data access method to Oracle
2) Version of the driver being used
3) A sample of the code/script being used

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Sandeep Krishna" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 27, 2000 3:12 AM
Subject: Re: unicode + oracle query... (suggestions needed...)


 i mean all the entries at both Web server machine's registry and
Oracle
 Database server machine's registry or either one.
 in our setup... my machine is the Web Server and the Oracle Server is a
 separate machine
 please clarify

 regards,

 Sandeep
 - Original Message -
 From: Kedar Moghe [EMAIL PROTECTED]
 To: Unicode List [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 3:21 PM
 Subject: RE: unicode + oracle query... (suggestions needed...)


 Sandeep,

 I think you need to change at following three places,
 HKEY_LOCAL_MACHINE\ORACLE\NLS_LANG
 HKEY_LOCAL_MACHINE\ORACLE\ALL_HOMES\ID0\NLS_LANG
 HKEY_LOCAL_MACHINE\ORACLE\HOME0\NLS_LANG

 Best of luck

 Regards,

 Kedar

 -Original Message-
 From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 27, 2000 5:45 PM
 To: Carl W. Brown; Bob Verbrugge; Kedar Moghe
 Cc: [EMAIL PROTECTED]
 Subject: Re: unicode + oracle query... (suggestions needed...)


 hi...

 i m thoroughly confused.
 actually the registry entries for oracle shows 3 entries for NLS_LANG.
 and that too at the WEB SERVER end and at the DATABASE SERVER end.
 so that makes tooo many combinations...

 can someone indicate which of these NLS_LANG entries have to be set as
 "AMERICAN_AMERICA.UTF8" and if some of them doesnt need this...what
exactly
 should be there

 pls suggest necessary messures..

 regards,

 Sandeep




 - Original Message -
 From: Bob Verbrugge [EMAIL PROTECTED]
 To: Sandeep Krishna [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 1:30 PM
 Subject: Re: unicode + oracle query... (suggestions needed...)


 Sandeep,

 You probably need to change the NLS_LANG Oracle setting in the registry.
 Look under
 HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE for this setting and change the
 character set part to UTF8.

 Bob.


 - Original Message -
 From: "Sandeep Krishna" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Sent: Wednesday, September 27, 2000 9:16 AM
 Subject: Re: unicode + oracle query... (suggestions needed...)


  hi,
 
  thankx for responding.
 
  but when u mention change in the registry..
  could u elaborate about where exactly in reg and what changes are
required
 
  my registry setting shows NLS = American_English.UTF8.
 
  is this the setting u indicated..or something to so with the charset
entry
 :
  autodetect and autodetect_all (in classid...Mimedatabasecharset..)
 
  pls do elaborate
 
  regards,
 
  Sandeep
 
 
 
  - Original Message -
  From: Kedar Moghe [EMAIL PROTECTED]
  To: 'Sandeep Krishna' [EMAIL PROTECTED]
  Sent: Wednesday, September 27, 2000 11:20 AM
  Subject: RE: unicode + oracle query... (suggestions needed...)
 
 
  Sandeep,
 
  I think you need to set the registry charset to UTF8 where database is
  installed. We were was getting the same problem when we use to send
UTF-8
  strings to oracle database after conversion from Shift-JIS to UTF8. That
  time also the byte sequence of the retrieved string is getting changed
and
  some of the bytes are getting replaced with BF.
 
  Regards,
 
  Kedar
 
  -Original Message-
  From: Sandeep Krishna [mailto:[EMAIL PROTECTED]]
  Sent: Wednesday, September 27, 2000 11:36 AM
  To: Unicode List
  Subject: unicode + oracle query... (suggestions needed...)
 
 
  hi
 
  actually i have been trying to use ASPs (UTF-8 encoding..) to write
 unicode
  cahracters to an Oracle DB table (varchar2 field)... and then retrieve
 them
  back..
  (i used UTF-8 encoding for both writing to the database and also for
  retriving and displaying..)
 
  there were some amazing observations...
 
  * each  unicode character was taking 7 bytes in the database. (instead
of
  expected 2 or 3...)
  * some unicode characters(or rather code points.) like' F95F' when
encoded
  in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
 as
  EF A5 9F..  in fact many unicode charcters whose encoded form 

unicode + oracle query....... (suggestions needed...)

2000-09-26 Thread Sandeep Krishna



hi

actually i have been trying to use ASPs (UTF-8 
encoding..) to write unicode cahracters to an Oracle DB table (varchar2 
field)... and then retrieve them back..
(i used UTF-8 encoding for both writing to the 
database and also for retriving and displaying..)

there were some amazing 
observations...

* each unicode character was taking 7 bytes in the database. (instead of expected 
2 or 3...)
* some unicode characters(or rather code points.) 
like' F95F' when encoded in UTF-8 was being 
encoded as EF A5 BF, when it should have 
been encoded as EF A5 9F.. in fact 
many unicode charcters whose encoded form had to had a byte in the range (80..9F) were being somehow changed to BF ... thus resulting in incorrect 
retrieval

I was unable to find the reasons for these strange 
occurrences
Pls suggest what could be the causes for 
these..

regards,

Sandeep.


*** 
SANDEEP KRISHNAMember Technical Staff (Priceline.com)H.C.L. Technologies 
LimitedA-1 CD, Sector -16, NOIDA, UP, India.Ph: 
91-11-91-4516321 (extn. 1062)Fax: 91-11-91-4510713, 4510226E-Mail : [EMAIL PROTECTED]