[jira] [Commented] (KAFKA-6168) Connect Schema comparison is slow for large schemas

Ewen Cheslack-Postava (JIRA) Sat, 25 Nov 2017 16:35:31 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265886#comment-16265886
 ]


Ewen Cheslack-Postava commented on KAFKA-6168:
----------------------------------------------

[~tedyu] I think the best next step before pursuing anymore optimizations would 
be to add at least a basic JMH benchmark, and probably better is to drive most 
of these optimizations off of profiling of real workloads. We're getting into 
micro-optimizations that should really be data driven as much as possible. This 
even ties back to the previous discussion re: precomputing vs caching the 
hashcode -- I would feel much more comfortable making a decision one way or the 
other there if we could have said, "regardless of whether a schema is always 
reused (e.g. predefined by the connector), is generated once per record (e.g. 
connector that doesn't attempt to cache & reuse varying schemas), or N are 
reused (e.g. connector that does some caching), simply doing X performs as well 
or near enough that it is the best choice". This could be via microbenchmarks 
(probably the easiest) or better end-to-end benchmarks.

Stuff like precomputing this string/byte[] representation is really hard to 
guess about the impact of. There are a lot of factors, including (off the top 
of my head), size and complexity of the schemas, reuse, size of the "flattened" 
representation and impact on allocations & GC, how long some schemas (and this 
flattened representation) are held on to, frequency of equality checks (which 
is probably tied heavily to hash collisions given the way many equality checks 
occur), and probably plenty more I'm not thinking of.

> Connect Schema comparison is slow for large schemas
> ---------------------------------------------------
>
>                 Key: KAFKA-6168
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6168
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 1.0.0
>            Reporter: Randall Hauch
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 1.1.0
>
>         Attachments: 6168.v1.txt
>
>
> The {{ConnectSchema}} implementation computes the hash code every time its 
> needed, and {{equals(Object)}} is a deep equality check. This extra work can 
> be expensive for large schemas, especially in code like the {{AvroConverter}} 
> (or rather {{AvroData}} in the converter) that uses instances as keys in a 
> hash map that then requires significant use of {{hashCode}} and {{equals}}.
> The {{ConnectSchema}} is an immutable object and should at a minimum 
> precompute the hash code. Also, the order that the fields are compared in 
> {{equals(...)}} should use the cheapest comparisons first (e.g., the {{name}} 
> field is one of the _last_ fields to be checked). Finally, it might be worth 
> considering having each instance precompute and cache a string or byte[] 
> representation of all fields that can be used for faster equality checking.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-6168) Connect Schema comparison is slow for large schemas

Reply via email to