siddharthteotia opened a new pull request #5061: Improvements to data anonymizer tool URL: https://github.com/apache/incubator-pinot/pull/5061 Following improvements are made based on the the latest usage to enhance an internal test framework - The global dictionary code was taking too long since the initial implementation was array based and used linear search while building the global dictionary. We had earlier avoided use of map/set to minimize the heap overhead as opposed to optimizing for performance. But it turns out that for higher cardinalities, linear search is significantly slowing down this tool. This PR implements a Map based global dictionary to keep the bound to O(logN). - Fixed bug for multi-value support. Avro API needs Object[] to be casted as Arrays.asList - Users of this tool might use a partial production dataset (as we are planning to do so) to workaround the memory requirements of this tool. In this case, the global dictionary will not have each and every value from the dictionary of original segment. So the query generator needs to be aware of this and should ignore such queries for which it is not able to rewrite the predicate since the original value never made it to the global dictionary as it was not a part of the chosen subset of segments. A new global dictionary interface is implemented. Existing array based implementation is kept intact and a new map based concrete implementation is written.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
