Re: International (Russian) support: continue

Sergei Kozello Mon, 08 Jul 2002 09:19:31 -0700

Nice to meet you!

> > I looked at autodetection, as I see it works properly
> > (the part, where the String is checked for including symbols with code >
> > 255),
> > but the problem with Russian, that Russian encodings use codes with code
>
> > 127, so this autocheck do not detect the code.
> > In CP1251 russian character has codes range (197 - 255), in CP886 the
codes
> > range is not solid and starts with 128 code.
> >
>
> ouch.  I think this means that autodetection may in fact be useless as
> well as inefficient.  Perhaps its time to kill it.  Thoughts Glen?
> (Since we can't very well do autodetection with 127 or we'll cut off
> 8-bit unicode characters as well).


It good idea and it is makes the point: user can choose where to use Unicode
himeself.


> > With cells encoding it is my misreading of the manual and code. Sorry.
> > I have founded that setting encoding for the cells works pretty well:
> >                 hssfCell.setEncoding( HSSFCell.ENCODING_UTF_16 );
> >                 hssfCell.setCellValue( cellValue );
> > And here it is!
> > Thank you for the help. :)
> >
>
> Great!  Well I forgot entirely about this.  For a quick history, IIRC I
> wanted this API as above, Marc thought it should/could be automatic and
> that it would be less prone to error if autodetected, so he incorporated
> autodetection into SSTRecord.  I said that was fine, provided the old
> API as I wanted it could still work even if it was kinda redundant.  He
> gave me that one.  I forgot about it entirely and so it lay undocumented
> until now.  (oops).

> Except I (or whomever beats me to it) need to document this better.  The
> Javadoc is way inadequate.  This has replaced logging as the most
> frequent question.  (next to maybe "when will feature x be done" to
> which I always reply "when you do it" ;-) )

As an idea to put it in the second HOWTO example. I think it will be
comfortable.

> > > I believe so.  We don't yet support 16-bit unicode strings for sheet
> > > names.  If someone supplies a patch for this
> > > I will gladly apply it.
> >
> > I tried to do it, but with no results, yet. :( I was confused
understanding
> > how the strings are serialized and deserialized. Could you tell or point
me
> > at the place (or chunk of code) where it is described and coded.
> >
>
> Sure.  the sheet name is in in org.apache.poi.model.Workbook under
> setSheetName, then org.apache.poi.records.BoundSheetRecord.
>
> As for how strings are serialized and deserialized!  HA there are
> SEVERAL convoluted and peculiar ways that Excel uses to do this.
> (Consistency is NOT the Microsoft way).  So I'll answer your question
> only as it applies to BoundSheetRecord and leave the explanation for all
> the other places for another day ;-).
> If you look in org.apache.poi.hssf.records.BoundSheetRecord you see this
> function:
>
>
(http://jakarta.apache.org/poi/javadocs/javasrc/org/apache/poi/hssf/record/B
oundSheetRecord_java.html#BoundSheetRecord)
>
> 262      public int serialize(int offset, byte [] data)
> 263      {
> 264          LittleEndian.putShort(data, 0 + offset, sid);
> 265          LittleEndian.putShort(data, 2 + offset,
> 266                                ( short ) (0x08 +
> getSheetnameLength()));
> 267          LittleEndian.putInt(data, 4 + offset, getPositionOfBof());
> 268          LittleEndian.putShort(data, 8 + offset, getOptionFlags());
> 269          data[ 10 + offset ] = getSheetnameLength();
> 270          data[ 11 + offset ] = getCompressedUnicodeFlag();
> 271
> 272          // we assume compressed unicode (bein the dern americans we
> are ;-p)
> 273          StringUtil.putCompressedUnicode(getSheetname(), data, 12 +
> offset);
> 274          return getRecordSize();
> 275      }
>
> Notice the comment - har har.  Actually, at the time (before POI 1.0)
> Marc and I made the design decision to address Unicode later because it
> was so inconsistent throughout Excel, so thats what that means (blush).
>
> According to page 291 of the Excel 97 Developer's Kit, at offset 10
> (including the 4 byte header) there is a 2 byte string "length" integer
> for the following (offset 12) Sheet name.  However I read somewhere that
> this is an error.  Therefore we have the
> "field_4_compressed_unicode_flag" which should be set to 0 for 8-bit or
> non-unicode (which is the default) or 1 for 16-bit unicode.  I assume
> from there (but I'm not positive) that the sheet name could just be
> stored via the string util function.
>
> I could be wrong though.

Thank you, it makes clear, I have wrote some my thoughts in the previous
letter.
Now I have some more to tell: first of all, it made to me quite many code
signths and sudies to advance in the direction of the understanding. %)
Now it is more understandable, but I don't feal the interactions between
different Records, the workflow.

> 264          LittleEndian.putShort(data, 0 + offset, sid);
> 265          LittleEndian.putShort(data, 2 + offset,
> 266                                ( short ) (0x08 +
getSheetnameLength()));
> 267          LittleEndian.putInt(data, 4 + offset, getPositionOfBof());
> 268          LittleEndian.putShort(data, 8 + offset, getOptionFlags());
> 269          data[ 10 + offset ] = getSheetnameLength();
> 270          data[ 11 + offset ] = getCompressedUnicodeFlag();
> 271
> 272          // we assume compressed unicode (bein the dern americans we
are ;-p)
> 273          StringUtil.putCompressedUnicode(getSheetname(), data, 12 +
offset );

And about Consistency, for example:
As far as I has understood the first and second words are the inner ones. Am
I right? (From the excel specs it starts from the BOF). ( and what is 0x008?
:)
Then dword and word are for the Bundle Sheet. It is OK.
The last two bytes are define UnicodeString BIFF8 (or simple String BIFF7)
properties.
And after the String content comes.
I feel unconsistency as the last two properies bites are described in the
UnicodeString class as well as here (BoundSheetRecord). Quite confusing.

The next point that stoped me for a peroid of time is
StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset);
in this class
and
StringUtil.putCompressedUnicode(unicodeString, data, 0x3 + offset);
in the UnicodeString class. (0x3 - please, explain )

Also I do not understand the workflow:
        try {
            String unicodeString = new
String(getString().getBytes("Unicode"),"Unicode");
            if (getOptionFlags() == 0)
            {
                StringUtil.putCompressedUnicode(unicodeString, data, 0x3 +
offset);
            }
            else
            {
                StringUtil.putUncompressedUnicode(unicodeString, data, 0x3 +
offset);
            }
        }
        catch (Exception e) {
            if (getOptionFlags() == 0)
            {
                StringUtil.putCompressedUnicode(getString(), data, 0x3 +
offset);
            }
            else
            {
                StringUtil.putUncompressedUnicode(getString(), data, 0x3 +
offset);
            }
       }
What for the encoding, I thought that it is the same 'A' decoded from
Unicode and encoded to Unicode still 'A'. What it is for?


> > Sure, I found it while looking at the jUnit reports, here it is from the
> > jUnit testcases (building info):
> > Running org.apache.poi.hssf.record.TestSSTRecord
> > Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 5,879 sec
> > TEST org.apache.poi.hssf.record.TestSSTRecord FAILED
> >
>
> could you perhaps do a
>
> ./build.sh site
>
> and then browse at build/docs
>
> and find the junit test results pages?
>
> You should be able to drill down into which test failed.  My problem is
> that the tests are succeeding for me.  This could be a
> Global/localization issue.  (meaning running on a system with Russian
> language settings might have an error that running on Linux box with
> English settings does not due to language defaults/etc).  If we can
> narrow it down to which TestSSTRecord test failed (there are 7 of them,
> I suspect its the rich text one:
>
http://jakarta.apache.org/poi/tests/junit/org/apache/poi/hssf/record/TestSST
Record.html)
>

I will attach the xml file with the test results. in the direct letter. I
will also ZIP it.
( it is testcase name="testProcessContinueRecord" )
As for my platform: it is MS Win2K Pro, JDK 1.4.0, Excel XP. :)


Bye!





--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: International (Russian) support: continue

Reply via email to