This is a guess, and a highly speculative one at this point as I have not
looked at the source code for HWPF, but it might be that there is confusion
surrounding the paragraph mark character.

Each paragraph is terminated with a 'special' control character that Word
refers to as a paragraph mark. It could be - and that is could bearing in
mind that HWPF is very immature at this point - that once the document is
encoded into Chinese, there are issues detecting the paragraph mark
correctly.

One easy was to check would be to see where HWPF is failing to detect the
end of paragraph. Does it always have problems if the paragraph ends with
the same character for example? Bearing in mind HWPFs immaturity, there
could also be problems associated with character encoding and the way the
application converts the raw bytes of information read from the file into
unicode characters. Aside from that, I am sorry to say that I do not have
anything concrete to contribute to the discussion.

Yours

Mark B


Bugzilla from [email protected] wrote:
> 
> https://issues.apache.org/bugzilla/show_bug.cgi?id=47875
> 
>            Summary: reading word written in Chinese, paragraph nums is not
>                     correct.
>            Product: POI
>            Version: 3.2-FINAL
>           Platform: PC
>         OS/Version: Windows XP
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: HWPF
>         AssignedTo: [email protected]
>         ReportedBy: [email protected]
> 
> 
> FileInputStream fileIn = new FileInputStream("D:\\111.doc"); 
> 
> WordExtractor extractor = new WordExtractor(fileIn); 
> 
> String[] paras =extractor.getParagraphText(); 
> System.out.println(paras.length); 
> 
> 
> why the paragraph nums is not correct? Reading in English looks like no
> problem. But my word is written in Chinese.
> 
> thanks!
> 
> -- 
> Configure bugmail:
> https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DO-NOT-REPLY--Bug-47875--New%3A-reading-word-written-in-Chinese%2C-paragraph-nums-is-not-correct.-tp25519112p25530585.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to