(Glen & Marc - String serialization....help please) Re:International (Russian) support: continue

Andrew C. Oliver Mon, 08 Jul 2002 17:49:02 -0700

On Mon, 2002-07-08 at 12:36, Sergei Kozello wrote:
> Nice to meet you!
>


Thanks.  Right back at you.

Lets move this to the poi-dev list.  We're probably blowing (confusing)
the minds of a great many people at this point.

> > > I looked at autodetection, as I see it works properly
> > > (the part, where the String is checked for including symbols with code >
> > > 255),
> > > but the problem with Russian, that Russian encodings use codes with code
> >
> > > 127, so this autocheck do not detect the code.
> > > In CP1251 russian character has codes range (197 - 255), in CP886 the
> codes
> > > range is not solid and starts with 128 code.
> > >
> >
> > ouch.  I think this means that autodetection may in fact be useless as
> > well as inefficient.  Perhaps its time to kill it.  Thoughts Glen?
> > (Since we can't very well do autodetection with 127 or we'll cut off
> > 8-bit unicode characters as well).
> 
> It good idea and it is makes the point: user can choose where to use Unicode
> himeself.
> 

great.  Submit a patch, I'll apply it.

> 
> > Except I (or whomever beats me to it) need to document this better.  The
> > Javadoc is way inadequate.  This has replaced logging as the most
> > frequent question.  (next to maybe "when will feature x be done" to
> > which I always reply "when you do it" ;-) )
> 
> As an idea to put it in the second HOWTO example. I think it will be
> comfortable.
> 

great.  Submit a patch.  I'll probably let glen apply it.

> 
> Thank you, it makes clear, I have wrote some my thoughts in the previous
> letter.
> Now I have some more to tell: first of all, it made to me quite many code
> signths and sudies to advance in the direction of the understanding. %)
> Now it is more understandable, but I don't feal the interactions between
> different Records, the workflow.
> 
> > 264          LittleEndian.putShort(data, 0 + offset, sid);
> > 265          LittleEndian.putShort(data, 2 + offset,
> > 266                                ( short ) (0x08 +
> getSheetnameLength()));
> > 267          LittleEndian.putInt(data, 4 + offset, getPositionOfBof());
> > 268          LittleEndian.putShort(data, 8 + offset, getOptionFlags());
> > 269          data[ 10 + offset ] = getSheetnameLength();
> > 270          data[ 11 + offset ] = getCompressedUnicodeFlag();
> > 271
> > 272          // we assume compressed unicode (bein the dern americans we
> are ;-p)
> > 273          StringUtil.putCompressedUnicode(getSheetname(), data, 12 +
> offset );
> 

org.apache.poi.hssf.usermodel = primarily a high level wrapper (though
it contains some mild relationships between objects)

org.apache.poi.hssf.model = primarily the "grammar" for the file format.

org.apache.poi.hssf.records.* = the "words" for the file format.

run org.apache.poi.hssf.dev.BiffViewer on a few live files and look at
the output in your favorite editor (mine is "vi").  This will give you a
lot more understanding.

Glen?  Anything to add?


> And about Consistency, for example:
> As far as I has understood the first and second words are the inner ones. Am
> I right? (From the excel specs it starts from the BOF). ( and what is 0x008?
> :)

I didn't understand the above... ?

> Then dword and word are for the Bundle Sheet. It is OK.

I think I understand.  Yes.  (you're spelling out the bytes 0x0 = "sid"
- java short value for bundlesheet record id).  at 0x2 = the record
size.  The record size is equal to the stringlength + 8 (which seems
incorrect -- shouldn't it be 0xA?)

> The last two bytes are define UnicodeString BIFF8 (or simple String BIFF7)
> properties.
> And after the String content comes.
> I feel unconsistency as the last two properies bites are described in the
> UnicodeString class as well as here (BoundSheetRecord). Quite confusing.
> 

I don't follow you.  UnicodeString doesn't appear to be used here.  I
don't remember exactly, but the inconsistancy may or may not be ours.

But if you're hinting that the *header* in the UnicodeString record
equals the stringsize + unicode flag taht is not set correctly, thats
probably correct.  I think you need to get rid of the "sheetnameLength"
field (at least from serialization) and get rid of the compressed
unicode flag, and use a UnicodeString instead.  Is that your meaning? 
If so I think you're correct.

> The next point that stoped me for a peroid of time is
> StringUtil.putCompressedUnicode(getSheetname(), data, 12 + offset);
> in this class
> and
> StringUtil.putCompressedUnicode(unicodeString, data, 0x3 + offset);
> in the UnicodeString class. (0x3 - please, explain )
> 
> Also I do not understand the workflow:
>         try {
>             String unicodeString = new
> String(getString().getBytes("Unicode"),"Unicode");
>             if (getOptionFlags() == 0)
>             {
>                 StringUtil.putCompressedUnicode(unicodeString, data, 0x3 +
> offset);
>             }
>             else
>             {
>                 StringUtil.putUncompressedUnicode(unicodeString, data, 0x3 +
> offset);
>             }
>         }
>         catch (Exception e) {
>             if (getOptionFlags() == 0)
>             {
>                 StringUtil.putCompressedUnicode(getString(), data, 0x3 +
> offset);
>             }
>             else
>             {
>                 StringUtil.putUncompressedUnicode(getString(), data, 0x3 +
> offset);
>             }
>        }
> What for the encoding, I thought that it is the same 'A' decoded from
> Unicode and encoded to Unicode still 'A'. What it is for?
> 


I don't understand what the catch statement is for, but if the unicode
flag is == 0 then write compressed unicode (8 bit) skipping 3 bytes
(probably for a header which is likely written elsewhere).  If its not
zero then write it in "UncompressedUnicode" (16bit) again skipping the 3
byte header.

> 
> > > Sure, I found it while looking at the jUnit reports, here it is from the
> > > jUnit testcases (building info):
> > > Running org.apache.poi.hssf.record.TestSSTRecord
> > > Tests run: 7, Failures: 1, Errors: 0, Time elapsed: 5,879 sec
> > > TEST org.apache.poi.hssf.record.TestSSTRecord FAILED
> > >
> >
> > could you perhaps do a
> >
> > ./build.sh site
> >
> > and then browse at build/docs
> >
> > and find the junit test results pages?
> >
> > You should be able to drill down into which test failed.  My problem is
> > that the tests are succeeding for me.  This could be a
> > Global/localization issue.  (meaning running on a system with Russian
> > language settings might have an error that running on Linux box with
> > English settings does not due to language defaults/etc).  If we can
> > narrow it down to which TestSSTRecord test failed (there are 7 of them,
> > I suspect its the rich text one:
> >
> http://jakarta.apache.org/poi/tests/junit/org/apache/poi/hssf/record/TestSST
> Record.html)
> >
> 
> I will attach the xml file with the test results. in the direct letter. I
> will also ZIP it.
> ( it is testcase name="testProcessContinueRecord" )
> As for my platform: it is MS Win2K Pro, JDK 1.4.0, Excel XP. :)
> 

Cool.  I'll take a look at it.

> 
> Bye!
> 
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 
-- 
http://www.superlinksoftware.com - software solutions for business
http://jakarta.apache.org/poi - Excel/Word/OLE 2 Compound Document in
Java                            
http://krysalis.sourceforge.net/centipede - the best build/project
structure
                    a guy/gal could have! - Make Ant simple on complex Projects!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

(Glen & Marc - String serialization....help please) Re:International (Russian) support: continue

Reply via email to