On 28 October 2010 08:42, LeMoyne <j...@mail2lee.com> wrote: > > Using the following sample from a git patch one can see one way in which the > current counting method comes up with fewer words than other methods do. > +1747,9 > 1.7.0.4 > 14 characters on two lines: either 2, 3 or 6 words depending on how you > count > > Gedit says: 2 lines 6 words 15 chars 14 chars(no spaces) > LibOdev says: 2 words 14 chars 14 chars excl spaces - (no stat line for > lines tho it has para counts) > > Gedit takes each number as a word breaking the words on punctuation > Gedit also counts the new line as whitespace > LibOdev counts all of any block of contiguous characters as a word > LibOdev in node word counter never sees the newline > > Over the diff part (from qgit) of Mattias' part 1 - sw patch file showing > gedit / LibOdev > Words: 2418 / 2414 > Chars: 24241 / 24241 > Chars – 16830 / 16830 (excl. spaces) > Now a near match in words and perfect match on chars excl spaces. > > Testing with a different entire patch file, the major difference is in words > 1338 to 1533 or ~200 out of 1400 words, but the total char and char excl. > spaces agree completely 13 459 and 10 157 > Taking into account the different word handling (top) and the way they match > then don't match I suspect a second difference in the counting method tween > gedit and LibOdev and differences in the line breaks in the files after cut > and paste. > > So far gedit and LibOdev agree completely ONLY on the non-space counts. > > I didn't check results on your reference odt because gedit wont open odt and > cut and paste just dumps the XML into the text... > Words 3997 / 18 > Chars 33429 / 125 > Chars – 28469 / 107 > Where the second smaller numbers are a page footer's counts. AFAIR - > LibOdev doesn't count the footer content and that might be the difference. > there are 20+ pages so thats 360+ words ~2500 chars in the footers > > I also saw how the LibOdev count is zero at load of the odt. Perhaps the > count is made somewhere else and saved on the doc without this code or it is > stored in the doc and loaded – either way the word count is marked clean so > it is not re-counted when the dialog box calls updateStats and the excl. > spaces count remains zero. Just clicking in the document causes a full > recount tho and that seems too busy somehow.. <-- more than enough guessing > there.... > > All these tests are with the aScanner.GetLen() > 1 check in place. With > that Len >=2 check, the new counting routine has no problem with single > letter words like A, a, 1, -, or just , > It is puzzling that Mattias removed the check to handle single char words on > his machine but a build out of master/LibOdev works (at least for me) with > that same check in …
Hmm, I originally left that check in because it was in Norbert's sketch code, and I figured he knew what was going on. But I definitely didn't get the right word count with it in place, and I did when I removed it. I was quite puzzled as to its purpose - your explanation about the leading spaces and the SwScanner makes sense, though, and I guess that's the reason it was there. > I will test changing back to Mattias simpler submission. (building now). > I must note that the block immediately after this count area word counts the > outline numbers (and counts the bullets as words!?!) - it does not have any > such length check at all... I think all the len=1 strings that the scanner > might give back are just CH_TXTATR_BREAKWORD = 0x01. And they are probably > Scanner's zero length string. Scanner's GetEnd points one slot past the end > of the string – i.e. for SwScanner GetEnd() = GetBegin() + GetLen() (no > -1 there) And that end spot likely has a break marker. > > Again gedit and LibOdev agree completely ONLY on the non-space counts. Nice analysis! I'm at work now, but with your explanations I'll look into things again when I get home, unless you've solved all the problems by then. I did notice the problem LO has with counting things like isolated punctuation as a word (and its deliberate choice to count bullets as words), but decided not to try and change it, since I figured step 1 was to add the feature without breaking the current behaviour :-P I also couldn't see a way to make it robust for all languages, especially those with non-Latin alphabets and weird punctuation markers. _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice