[jira] [Comment Edited] (THRIFT-5231) Improve Haskell parsing performance

Philipp Hausmann (Jira) Mon, 08 Jun 2020 05:11:20 -0700


    [ 
https://issues.apache.org/jira/browse/THRIFT-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128239#comment-17128239
 ]


Philipp Hausmann edited comment on THRIFT-5231 at 6/8/20, 12:10 PM:
--------------------------------------------------------------------

So I have been looking through the full Haskell API and have been thinking 
about how to improve both performance and the API. I believe the code can be 
made to  be more idiomatic haskell and also to be clearer at the same time. I 
think the main complication here is the header transport/protocol, because that 
adds state.

 

A rough sketch would be to build Parser instances in the code generator instead 
of stringing together readVal calls. This should make it easier to parse input 
in bigger chunks. The generated top-level process function would then need to 
call runParser, but there it should be quite straight-forward to handle 
left-over bits from the last parse. For the THeader aspect, it may be easiest 
to wrap that in a StateT transformer and use that to propagate the protocol id. 
I am also leaning towards removing the Transport instance for HeaderTransport, 
because as far as I can see HeaderTransport and HeaderProtocol only make sense 
if used together.

 

The API exposed by the generated code can probably stay quite similar and keep 
API compatibility. If changes are necessary, it probably is restricted to the 
top-level setup of the main loop.

 

As this is quite an invasive change, I would love to get some preliminary 
feedback to get a feeling if such a PR has a chance of being accepted or not.

 

PS: Is there some more documentation for the THeader protocol? I found 
[https://github.com/apache/thrift/blob/master/doc/specs/HeaderFormat.md] , but 
e.g. the magic numbers are not specified nor does it explicitly state if the 
header is sent in front of each message.

 

PS2: I am not sure if a `FromThrift` class would make sense and to store the 
Parsers there and generate code accordingly.


was (Author: phile314):
So I have been looking through the full Haskell API and have been thinking 
about how to improve both performance and the API. I believe the code can be 
made to  be more idiomatic haskell and also to be clearer at the same time I 
think the main complication here is the header transport/protocol, because that 
adds state.

 

A rough sketch would be to build Parser instances in the code generator instead 
of stringing together readVal calls. This should make it easier to parse input 
in bigger chunks. The generated top-level process function would then need to 
call runParser, but there it should be quite straight-forward to handle 
left-over bits from the last parse. For the THeader aspect, it may be easiest 
to wrap that in a StateT transformer and use that to propagate the protocol id. 
I am also leaning towards removing the Transport instance for HeaderTransport, 
because as far as I can see HeaderTransport and HeaderProtocol only make sense 
if used together.

 

The API exposed by the generated code can probably stay quite similar and keep 
API compatibility. If changes are necessary, it probably is restricted to the 
top-level setup of the main loop. 

 

As this is quite an invasive change, I would love to get some preliminary 
feedback to get a feeling if such a PR has a chance of being accepted or not.

 

PS: Is there some more documentation for the THeader protocol? I found 
[https://github.com/apache/thrift/blob/master/doc/specs/HeaderFormat.md] , but 
e.g. the magic numbers are not specified nor does it explicitly state if the 
header is sent in front of each message.

 

PS2: I am not sure if a `FromThrift` class would make sense and to store the 
Parsers there and generate code accordingly.

> Improve Haskell parsing performance
> -----------------------------------
>
>                 Key: THRIFT-5231
>                 URL: https://issues.apache.org/jira/browse/THRIFT-5231
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Haskell - Library
>    Affects Versions: 0.13.0
>            Reporter: Philipp Hausmann
>            Priority: Major
>         Attachments: Main.hs, parse_benchmark.html
>
>
> We are using Thrift for (de-)serializing some Kafka messages and noticed that 
> already at low throughput (1000 messages / second) a lot of CPU is used.
>  
> I did a small benchmark just parsing a single T_BINARY value and if I use 
> `readVal` for that it takes ~3ms per iteration. If instead I directly run the 
> attoparsec parser, it only takes ~ 300ns. This is a difference by 4 orders of 
> magnitude! Some difference is reasonable as when using `readVal` some IO and 
> shuffling around bytestrings is involved, but the difference looks huge.
>  
> I strongly suspect the implementation of `runParser` is not optimal. 
> Basically it runs the parser with 1 Byte, and until it succeeds it appends 1 
> byte and retries. This means that for a value of size 1024 bytes, we e.g. try 
> to parse it 1023 times. This seems rather inefficient.
>  
> I am not really sure how to best fix this. In principle, it makes sense to 
> feed bigger chunks to attoparsec and store the left-overs somewhere for the 
> next parse. However, if we store it in the transport or protocol we have to 
> implement it for each transport/protocol. Maybe an API change is necessary?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (THRIFT-5231) Improve Haskell parsing performance

Reply via email to