Hi,all
We met a scenario with strictly performance requirements. With some code
reading of query service and engine, here’s some questions:
1. The cache doesn’t work.
Assuming a query with 11000 result rows, the first execution consumes 5+
seconds, and execute it immediately, still consumes 5 seconds. Looking into the
query log, both the two queries has 'Hit Cache: false'. The duration threshold
has been set to 3000, and the result count threshold is 10000. With additional
log, the two queries has same hash code, and both been put into cache. But
there’s a warning log as followed, which not sure be the reason:
[WARN][net.sf.ehcache.pool.sizeof.ObjectGraphWalker.checkMaxDepth(ObjectGraphWalker.java:209)]
- The configured limit of 1,000 object references was reached while attempting
to calculate the size of the object graph. Severe performance degradation could
occur if the sizing operation continues. This can be avoided by setting the
CacheManger or Cache <sizeOfPolicy> elements maxDepthExceededBehavior to
"abort" or adding stop points with @IgnoreSizeOf annotations. If performance
degradation is NOT an issue at the configured limit, raise the limit value
using the CacheManager or Cache <sizeOfPolicy> elements maxDepth attribute. For
more information, see the Ehcache configuration documentation.
2. How offset (query with page) worked?
The user want to query kylin with page like mysql. I’ve found the
writing like 'limit 100000 offset 20 row fetch next 10 row only’ worked,
meaning fetch next 10 rows start from the 21st row, and limit up to 100000. In
fact, the limit is not necessary for user, the reason to add this is to avoid
auto adding ‘limit 50000’, so is it possible to disable the auto adding?
The second question is the processing of query with offset. One
possible way is scan the first 30 (or 100000) rows, and filter out only the
21~30 rows by Calcite engine. Another way is jump to the 21st row directly, and
scan out only the next 10 rows. I guess we process it in the first way for now,
because there’s no way to directly jump specific rows in HBase. If my guessing
is correct, how about the next query with 'limit 100000 offset 30 row fetch
next 10 row only’? Do we need to scan out all results again? If yes, a cache in
here should accelerate the next offset queries.
3. How 'order by desc' worked?
The 'order by' also be used. But some result seems not right with
‘order by desc’. Considering a query with 10000 result rows, adding ‘limit 5000
order by desc’, the result should be 10000~5001 rows, but 5000~1 in fact. I
guess the reason is, the order was processed in Calcite engine after scanning
result, and the scan of HBase is always ascend, that’s why got the result with
1~5000. To resolve this, is it possible to push order by down into hbase scan,
and cast into a reverse scan?
I’m not very familiar with query service and engine for now, but feels that
there’s still some room for performance improvements.
Looking for any comments.
Best Regards,
Yerui Sun
[email protected]