Also we fall back to windows-1252 encoding in the parser.character.encoding.default property when we can't find anything else.
On Tue, Feb 14, 2012 at 10:34 PM, Lewis John Mcgibbney < [email protected]> wrote: > It's in HTMLParser#private static String sniffCharacterEncoding > > I'm still wondering where TikaParser gets the character encoding from > though? Additionally, this doesn't look like something we check for in our > JUnit classes? If we don't then I would like to write some tests to test > for this. > > I am working on Any23 tests first, so this provides the justification > behind my question. > > Thanks > > Lewis > > > On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Hi, >> >> I can't see anywhere within our parser plugins where we detect encoding >> of documents. I've also begun looking through the o.a.n.p package but again >> I can't see anything. >> >> Can anyone provide some detail on this please? >> >> Thank you >> >> Lewis >> >> >> >> -- >> *Lewis* >> >> > > > -- > *Lewis* > > -- *Lewis*

