Config fiules are XML and I changed them to be handled by the XML parser (InputStreams), so XML parser reads encoding from Header.
But JSON is defined to be UTF-8, so we must supply the encoding (IOUtils.UTF8_CHARSET). ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Dawid Weiss [mailto:[email protected]] > Sent: Thursday, July 05, 2012 5:00 PM > To: [email protected] > Subject: Question about solr config files encoding. > > Guys should the encoding of config files really be platform-dependent? > Currently Solr tests fail massively on setup because of things like > this: > > public OpenExchangeRates(InputStream ratesStream) throws IOException { > parser = new JSONParser(new InputStreamReader(ratesStream)); > > this reader, when confronted with UTF-16 as file.encoding results in funky > exceptions like: > > > Caused by: org.apache.noggit.JSONParser$ParseException: JSON Parse > Error: char=笊,position=0 BEFORE='笊' > AFTER='†≤楳捬慩浥爢㨠≔桩猠摡瑡猠捯汬散瑥搠晲潭⁶慲楯畳⁰牯癩摥牳 > 湤⁰牯癩摥搠晲' > > at org.apache.noggit.JSONParser.err(JSONParser.java:221) > > at org.apache.noggit.JSONParser.next(JSONParser.java:620) > > at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661) > > at > org.apache.solr.schema.OpenExchangeRatesOrgProvider$OpenExchangeRates. > <init>(OpenExchangeRatesOrgProvider.java:189) > > at > org.apache.solr.schema.OpenExchangeRatesOrgProvider.reload(OpenExchang > eRatesOrgProvider.java:129) > > Can we fix the encoding of these input files to UTF-8 or something? > According to JSON RFC: > > http://tools.ietf.org/html/rfc4627#section-3 > > JSON text SHALL be encoded in Unicode. The default encoding is > UTF-8. > > Since the first two characters of a JSON text will always be ASCII > characters [RFC0020], it is possible to determine whether an octet > stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking > at the pattern of nulls in the first four octets. > > 00 00 00 xx UTF-32BE > 00 xx 00 xx UTF-16BE > xx 00 00 00 UTF-32LE > xx 00 xx 00 UTF-16LE > xx xx xx xx UTF-8 > > We could just enforce/require UTF-8? Alternatively, auto-detect this from a > binary stream as a custom Reader class. > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] For additional > commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
