siddharthteotia opened a new pull request #5061: Improvements to data 
anonymizer tool
URL: https://github.com/apache/incubator-pinot/pull/5061
 
 
   Following improvements are made based on the the latest usage to enhance an 
internal test framework
   
   - The global dictionary code was taking too long since the initial 
implementation was array based and used linear search while building the global 
dictionary. We had earlier avoided use of map/set to minimize the heap overhead 
as opposed to optimizing for performance. But it turns out that for higher 
cardinalities, linear search is significantly slowing down this tool. This PR 
implements a Map based global dictionary to keep the bound to O(logN).
   
   - Fixed bug for multi-value support. Avro API needs Object[] to be casted as 
Arrays.asList
   
   - Users of this tool might use a partial production dataset (as we are 
planning to do so) to workaround the memory requirements of this tool. In this 
case, the global dictionary will not have each and every value from the 
dictionary of original segment. So the query generator needs to be aware of 
this and should ignore such queries for which it is not able to rewrite the 
predicate since the original value never made it to the global dictionary as it 
was not a part of the chosen subset of segments.
   
   A new global dictionary interface is implemented. Existing array based 
implementation is kept intact and a new map based concrete implementation is 
written.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to