kishoreg commented on issue #4293: Add support for an aggregation function 
returning serialized hyperlog…
URL: https://github.com/apache/incubator-pinot/pull/4293#issuecomment-502188864
 
 
   The code looks good to me. I think it's better to return byte[] as hexString 
instead of string. HllUtil has a toBytes method. The HLLUtil.toString has some 
additional overhead that can be avoided. This will allow you to use HyperLogLog 
library directly to parse the byte[] in HyperLogLog.
   
   Coming back to the original problem. 
   ```SELECT func(m1), func(m2)..... FROM T WHERE pageId in <1000+ values> 
GROUP BY job_title (high cardinality)```
   
   There are 4 possible options. 
   
   1. No batching, just get all the results in one shot. Works if the 
cardinality of job_title < 100k
   2. Batch by pageId
   2. Batch by job_title
   3. Batch by pageId and job_title (nested loop). 
   
   If possible always pick 1. After that, batching by job_title is the right 
solution since each response is mutually exclusive and the client can simply 
stitch the responses together without additional processing.
   
   But what you are suggesting is solution 1 - batch by pageId. I am not sure 
why this will be better unless there is some relationship between pageId and 
jobTitle such that restricting pageId will automatically limit jobTitle.
   
   Does this line of reasoning make sense? 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to