Not directly Lucene related, but I'm out of ideas and I'm not a Russian
speaker...
I'm extracting text from RTF to pump into Lucene. I'm using the original
RTFEditorKit() code shown in LIA, p252 (actually, it's Nutch's RTFParser)
I have an RTF document, which starts with
---
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\froman\fprq2\fcharset204{\*\fname
Times New Roman;}Times New Roman CYR;}{\f1\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green0\blue128;\red0\green0\blue0;}
\viewkind4\uc1\pard\tx360\cf1\f0\fs20\'c1\'ee\'eb\'fc\'f8\'e8\'ed\'f1\'f2\'e2\'ee
---
which should be 'Большинство', but when the RTFReader translationTable always
maps the RTF bytes to char using latin1 and it never sets the correct
translationTable. The "fcharset204" is Russian, apparently CP1251, but there's
a lovely line in the RTFReader class
/* TODO: per-font font encodings ( \fcharset control word ) ? */
Does anyone know if the RTF above is correct - the only place the translation
table is set during the parse is when the 'ansi' keyword is set.
Other than that, anyone have any ideas about getting the text out of the RTF
properly?
Thanks
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]