[
https://issues.apache.org/jira/browse/THRIFT-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132703#comment-17132703
]
Jens Geyer edited comment on THRIFT-5231 at 6/10/20, 8:39 PM:
--------------------------------------------------------------
https://github.com/apache/thrift/blob/master/lib/java/src/org/apache/thrift/protocol/TType.java
The c++ headers seem to need some cleanup. I could track it back to commit
d42a2c2bf9630cfb4d9d49cbee1fc812e5e5777d when the various string type
constants had bveen added.
These numerical constants have never neen used AFAIK:
* T_UTF8 = 16,
* T_UTF16 = 17
This is the right type for strings:
* T_STRING = 11, ...
And this seems plain wrong. As per Whitepaper, all strings in Thrift are
transmitted as UTF-8 across the wire, not UTF-7. Sot 11 should be T_UTF8, but
IMHO all of these except T_STRING should be thrown out.
* T_UTF7 = 11,
Maybe [~mcslee] wants to add more insights?
was (Author: jensg):
https://github.com/apache/thrift/blob/master/lib/java/src/org/apache/thrift/protocol/TType.java
The c++ headers seem to need some cleanup. I could track it back to commit
d42a2c2bf9630cfb4d9d49cbee1fc812e5e5777d when the various string type
constants had bveen added.
These numerical constants have never neen used AFAIK:
* T_UTF8 = 16,
* T_UTF16 = 17
This is the right type for strings:
* T_STRING = 11, ...
And this seems plain wrong. As per Whitepaper, all strings in Thrift are
transmitted as UTF-8 across the wire, not UTF-7. Sot 11 should be T_UTF8, but
IMHO all of these except T_STRING should be thrown out.
* T_UTF7 = 11,
Maybe @mcslee wants to add more insights?
> Improve Haskell parsing performance
> -----------------------------------
>
> Key: THRIFT-5231
> URL: https://issues.apache.org/jira/browse/THRIFT-5231
> Project: Thrift
> Issue Type: Improvement
> Components: Haskell - Library
> Affects Versions: 0.13.0
> Reporter: Philipp Hausmann
> Priority: Major
> Attachments: Main.hs, parse_benchmark.html
>
>
> We are using Thrift for (de-)serializing some Kafka messages and noticed that
> already at low throughput (1000 messages / second) a lot of CPU is used.
>
> I did a small benchmark just parsing a single T_BINARY value and if I use
> `readVal` for that it takes ~3ms per iteration. If instead I directly run the
> attoparsec parser, it only takes ~ 300ns. This is a difference by 4 orders of
> magnitude! Some difference is reasonable as when using `readVal` some IO and
> shuffling around bytestrings is involved, but the difference looks huge.
>
> I strongly suspect the implementation of `runParser` is not optimal.
> Basically it runs the parser with 1 Byte, and until it succeeds it appends 1
> byte and retries. This means that for a value of size 1024 bytes, we e.g. try
> to parse it 1023 times. This seems rather inefficient.
>
> I am not really sure how to best fix this. In principle, it makes sense to
> feed bigger chunks to attoparsec and store the left-overs somewhere for the
> next parse. However, if we store it in the transport or protocol we have to
> implement it for each transport/protocol. Maybe an API change is necessary?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)