reductionista opened a new pull request #467: DL: Improve performance of mini-batch preprocessor URL: https://github.com/apache/madlib/pull/467 This commit adds the following optimizations to the minibatch preprocessor: - Skip normalization if normalizing constant is 1.0 - Split the batching query to generate buffer_id's (based of row_id) without moving around any data. Previously, calling `ROW_NUMBER() OVER()` to add row_id's to the table was causing the data to be gathered on the master node and then numbering the rows, which for large datasets would be taking most of the time. - Separate out the JOIN (called for even distribution) as well as converting to BYTEA from the batching query. This avoids any VMEM limit issues. - `num_buffers` gets rounded up to the nearest multiple of `num_segments` for even distribution across buffers on segments. - Add new C function `array_to_bytea()` to convert array to `bytea`, and some tests for it. This is much faster than the python version we were using, speeding up the query significantly. Additionally, this commit adds a new function `plpy_execute()` that prints `EXPLAIN` plans and execution time for debugging.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
