[ 
https://issues.apache.org/jira/browse/CAMEL-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245555#comment-16245555
 ] 

Robert Half edited comment on CAMEL-11846 at 11/9/17 12:23 PM:
---------------------------------------------------------------

Hi Viral,

I have a workaround first: I use BufferedInputStream wrapper, so I am able to 
reset it later (don't need to open the file twice). I give the InputStream  to 
XmlStreamReader, which gives me the encoding after reading XML file prolog. 
Then I set it for camel on the Exchange.CHARSET_NAME header:


{code:java}
EncodingUtil.DetectedEncodingStream detectedEncodingStream = 
EncodingUtil.detectEncoding(inputStream, new StaxConverter().getInputFactory());
            inputStream = detectedEncodingStream.inputStream;
            exchange.getIn().setHeader(Exchange.CHARSET_NAME, 
detectedEncodingStream.encoding);
{code}


{code:java}
public class EncodingUtil {

    public static class DetectedEncodingStream {
        public InputStream inputStream;
        public String encoding;

        public DetectedEncodingStream(InputStream inputStream, String encoding) 
{
            this.inputStream = inputStream;
            this.encoding = encoding;
        }
    }

    private static final int MAX_REWINDABLE_STREAM_BUFFER = 2*4196;

    public static final Logger LOGGER = 
LoggerFactory.getLogger(EncodingUtil.class);

    public static DetectedEncodingStream detectEncoding(InputStream 
inputStream, XMLInputFactory xmlInputFactory) {
        final BufferedInputStream bufferedInputStream = new 
BufferedInputStream(inputStream, MAX_REWINDABLE_STREAM_BUFFER);
        bufferedInputStream.mark(MAX_REWINDABLE_STREAM_BUFFER);
        String encoding;
        XMLStreamReader xmlStreamReader = null;
        try {
            xmlStreamReader = 
xmlInputFactory.createXMLStreamReader(bufferedInputStream);
        } catch (XMLStreamException e) {
            throw new RuntimeException(e);
        } finally {
            try {
                bufferedInputStream.reset();
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                try {
                    xmlStreamReader.close();
                } catch (XMLStreamException e) {
                    throw new RuntimeException("Failed to close 
XmlStreamRader", e);
                }
            }
        }

        encoding = xmlStreamReader.getCharacterEncodingScheme();
        if (encoding == null) {
            encoding = StandardCharsets.UTF_8.name();
        }
        return new DetectedEncodingStream(bufferedInputStream, encoding);
    }
}
{code}



was (Author: antidote2):
Hi Viral,

I have a workaround first: I use BufferedInputStream wrapper, so I am able to 
reset it later (don't need to open the file twice). I give the InputStream  to 
XmlStreamReader, which gives me the encoding after reading XML file prolog. 
Then I set it for camel on the Exchange.CHARSET_NAME header:

EncodingUtil.DetectedEncodingStream detectedEncodingStream = 
EncodingUtil.detectEncoding(inputStream, new StaxConverter().getInputFactory());
            inputStream = detectedEncodingStream.inputStream;
            exchange.getIn().setHeader(Exchange.CHARSET_NAME, 
detectedEncodingStream.encoding);

{code:java}
public class EncodingUtil {

    public static class DetectedEncodingStream {
        public InputStream inputStream;
        public String encoding;

        public DetectedEncodingStream(InputStream inputStream, String encoding) 
{
            this.inputStream = inputStream;
            this.encoding = encoding;
        }
    }

    private static final int MAX_REWINDABLE_STREAM_BUFFER = 2*4196;

    public static final Logger LOGGER = 
LoggerFactory.getLogger(EncodingUtil.class);

    public static DetectedEncodingStream detectEncoding(InputStream 
inputStream, XMLInputFactory xmlInputFactory) {
        final BufferedInputStream bufferedInputStream = new 
BufferedInputStream(inputStream, MAX_REWINDABLE_STREAM_BUFFER);
        bufferedInputStream.mark(MAX_REWINDABLE_STREAM_BUFFER);
        String encoding;
        XMLStreamReader xmlStreamReader = null;
        try {
            xmlStreamReader = 
xmlInputFactory.createXMLStreamReader(bufferedInputStream);
        } catch (XMLStreamException e) {
            throw new RuntimeException(e);
        } finally {
            try {
                bufferedInputStream.reset();
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                try {
                    xmlStreamReader.close();
                } catch (XMLStreamException e) {
                    throw new RuntimeException("Failed to close 
XmlStreamRader", e);
                }
            }
        }

        encoding = xmlStreamReader.getCharacterEncodingScheme();
        if (encoding == null) {
            encoding = StandardCharsets.UTF_8.name();
        }
        return new DetectedEncodingStream(bufferedInputStream, encoding);
    }
}
{code}


> xtokenize and apply xslt to a string does not work  with UTF-16BE
> -----------------------------------------------------------------
>
>                 Key: CAMEL-11846
>                 URL: https://issues.apache.org/jira/browse/CAMEL-11846
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-core
>    Affects Versions: 2.17.5
>            Reporter: Robert Half
>
> In XML, encoding is often provided inside <?xml ..?> tag. In general, you 
> cannot read the tag, if you don't know the encoding, but XML Parsers support 
> the detection of several encodings which allows them to read the tag. With 
> that information they can read the whole file without knowing the "charset" 
> in first place.
> xtokenize and xslt use XmlInputFactory#createXmlStreamReader(Reader). But by 
> providing a reader Camel tells, that it knows the encoding, so it won't be 
> detected by the XML parser.
> Also Camel sets the charset to UTF-8 if it is not provided inside a header. 
> This makes the underlying reader fail reading UTF-16.
> Using XmlInputFactory#createXmlStreamReader(InputStream) inside 
> XMLTokenExpressionIterator works (tried in a patch). But the next xslt steps 
> fails again because it again uses a Reader.
> See Stackoverflow Question for reference:
> [https://stackoverflow.com/questions/46322376/apache-camel-to-handle-encoding-declared-in-xml-file]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to