[
https://issues.apache.org/jira/browse/ABDERA-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689256#action_12689256
]
jv ning commented on ABDERA-222:
--------------------------------
This appears to trigger when the socket read boundaries fall such that the
first byte of a multi byte character is the first byte in a read from the
network socket.
In our failing case, there are 3 reads issed against the input stream returned
by the httpmethod.
1 for 4 bytes
1 for 196 bytes
1 for 3800 bytes
and then for 4 k bytes.
In our failing case, the read for 196 bytes does returns less that 196 bytes,
and the first character read in the next read is the start byte of our
multibyte character.
The multi-byte character is returned in the 3rd READ_ARRAY call and written to
position 200 in the input buffer.
When the mutli-byte character is not the first byte sequence returned by read,
there is no exception.
"TIME" "method" "read byte count" "read byte count after mark
resets" "where read data is written into the buffer passed to read"
"read request size" "count read"
1238017735367 " AVAILABLE" 0 0 0 4 4
1238017735367 "READ_ARRAY" 0 0
1238017735367 " AVAILABLE" 4 4
1238017735367 "READ_ARRAY" 4 4 4 196 158
1238017735367 " AVAILABLE" 162 162
1238017735367 "READ_ARRAY" 162 162 200 3800 2890
1238017735370 " CLOSE" 3052 3052
> Parse failures reading utf-8 xml files that have attribute values that
> contain non US-ASCII valid utf-8 characters
> ------------------------------------------------------------------------------------------------------------------
>
> Key: ABDERA-222
> URL: https://issues.apache.org/jira/browse/ABDERA-222
> Project: Abdera
> Issue Type: Bug
> Affects Versions: 0.4.0
> Environment: solarix x86_64, MaxOS Leopard x86_64, linux x86_64
> Reporter: jv ning
>
> When parsing XML files that are items fetched by http-client 3.1
> The same items parse correctly, if written to a byte array and then a
> ByteArrayInputStream on the byte array, is passed to parse.
> parser.parse(response.getResponseBodyAsStream());
> Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
> (NULL, unicode 0) encountered: not valid in any content
> at [row,col {unknown-source}]: [3,56]
> at
> com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:615)
> at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:644)
> at
> com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4554)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2886)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
> at
> org.apache.abdera.parser.stax.FOMBuilder.getNextElementToParse(FOMBuilder.java:163)
> at org.apache.abdera.parser.stax.FOMBuilder.next(FOMBuilder.java:187)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.