https://bugs.freedesktop.org/show_bug.cgi?id=57776

          Priority: high
            Bug ID: 57776
          Assignee: libreoffice-bugs@lists.freedesktop.org
           Summary: Bad strings cause word count to fail in Japanese
          Severity: major
    Classification: Unclassified
                OS: Mac OS X (All)
          Reporter: ma...@telebody.net
          Hardware: All
            Status: UNCONFIRMED
           Version: 3.5.2 release
         Component: Writer
           Product: LibreOffice

Created attachment 70876
  --> https://bugs.freedesktop.org/attachment.cgi?id=70876&action=edit
Large document showing disparity of 10 between NOC and NOCES

This count may or may not be related to bugs 56975, 54918, 54483, 55359, and
55707.

Bad strings cause word count to fail in Japanese.

LibreOffice’s Tools>Word Count function displays number of characters (NOC) and
number of characters excluding spaces (NOCES).

I discovered NOCES > NOC cases due to certain character sequences in a Japanese
document, which should be impossible.

For example, the word count for attached word document (file 1) was 980 NOC and
990 NOCES in LibreOffice (screenshot of file 2).

NOC was found to be correct, and was the same as that calculated by Microsoft
Word (screenshot of file 3)

By taking a short 100 character portion I was able to discover a bad character
sequence (word document of file 4) which can be used to arbitrarily increase
the difference NOCES minus NOC (screenshot of file 5). 

By deleting a character from that string in each instance I was able to revert
the difference NOCES-NOC from 6 to 1 (screenshot of file 6). The 100 character
portion also contains 1 instance so it did not go to 0.

Note that the original 980 character document (file 1) has NOCES-NOC = 10. This
offset means there are 10 bad sequences in 1 page of text, a very high error
rate.

I picked out a second bad string for your comparison (file 7).

In addition, I think the function should be checked to make sure NOC is
correct. I thought I found a case where it was wrong, but it seems okay now.

There may be similar problems with English / Unicode. This text was produced on
Mac OS X and is probably Japanese Unicode... not sure about that.

I would also like to mention an enhancement request:

In general it would be useful if the user could input characters to be ignored
when counting. In particular, some customers will set a project monetary value
based number of characters, not counting any English letters, numerals, or
Japanese punctuation.

In the following I would like to also mention a few points about how this
function is used in the real world (I am also a professional translator). This
is provided for closure. Incidentally the same Word Count function in Microsoft
Word is a source of mystery for all users of MS Word for decades so it pays to
think it through. It would easy to make a superior function to that in Word for
Japanese. 

The word count function is that it provides a count of “number of words” (NOW).
It is very hard to count words in Japanese, although code does exist (academic
morphological analyzers like IIRC, Kakashi) which gives a set of English
character strings given Japanese text. It would be useful to tell the user how
NOW is calculated in LO for Japanese text, as this can be used as a basis for
communication with a customer. 

In general, people do not count Japanese words. Although I have seen a client
count them and it was impossible to refute the number without counting them
myself.

An easier way is you can try to multiply number of characters by an average
number of characters per Japanese “word” (where for example counting means just
the same as in English where single letter participles count for a word and
compounds that translate to two English words also are two words). LO gave a
number close to but different from the number I got, so one wonders what the
algorithm is. Anyway this is not a critical matter but part of the mystery of
this dialog. 

The uses of Word Count that I have myself seen are:
- For billing purposes. The customer sometimes bills based on number of target
English words, and sometimes based on number of source Japanese characters.
- For estimating amount of time it will take to do a job, or one’s efficiency.
- For calculating the Japanese characters per English word ratio which differs
according to the subject matter. A normal ratio is 1.7 whereas it can go up
near 3 for biochemistry, so this is important in order to ensure the billing
rate reflects the amount of effort involved.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs

Reply via email to