reductionista opened a new pull request #467: DL: Improve performance of 
mini-batch preprocessor
URL: https://github.com/apache/madlib/pull/467
 
 
   This commit adds the following optimizations to the minibatch preprocessor:
   
   - Skip normalization if normalizing constant is 1.0
   
   - Split the batching query to generate buffer_id's (based of row_id)
   without moving around any data. Previously, calling `ROW_NUMBER()
   OVER()` to add row_id's to the table was causing the data to be gathered
   on the master node and then numbering the rows, which for large datasets
   would be taking most of the time.
   
   - Separate out the JOIN (called for even distribution) as well as
   converting to BYTEA from the batching query. This avoids any VMEM limit
   issues.
   
   - `num_buffers` gets rounded up to the nearest multiple of `num_segments`
   for even distribution across buffers on segments.
   
   - Add new C function `array_to_bytea()` to convert array to `bytea`, and
   some tests for it.  This is much faster than the python version we were
   using, speeding up the query significantly.
   
   Additionally, this commit adds a new function `plpy_execute()` that
   prints `EXPLAIN` plans and execution time for debugging.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to