aucahuasi commented on PR #14226: URL: https://github.com/apache/arrow/pull/14226#issuecomment-1278714058
I also experimented with the python script provided in this related/duplicated Jira ticket: https://issues.apache.org/jira/browse/ARROW-17590 I ran 3 times the python script using this PR and I notice that we are using less memory for reading prebuffered files now, here are the details: ================================ Arrow version: master pre_buffer=True ```python 0 rss: 88.390625 MB 1 rss: 1374.640625 MB pa.total_allocated_bytes 43.61480712890625 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 1294.765625 MB 3 rss: 1294.765625 MB pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3 ``` ================================ Arrow version: this PR pre_buffer=True 1st run ```python 0 rss: 87.5 MB 1 rss: 728.921875 MB pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 731.375 MB 3 rss: 731.375 MB pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3 ``` ================================ Arrow version: this PR pre_buffer=True 2nd run ```python 0 rss: 87.703125 MB 1 rss: 729.5 MB pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 610.328125 MB 3 rss: 610.328125 MB pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3 ``` ================================ Arrow version: this PR pre_buffer=True 3rd run ```python 0 rss: 87.484375 MB 1 rss: 729.859375 MB pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 732.34375 MB 3 rss: 732.34375 MB pyarrow 10.0.0.dev4072+gc32f988f5.d20221014 pandas 1.5.0 numpy 1.23.3 ``` ================================ Arrow version: master pre_buffer=False ```python 0 rss: 87.828125 MB 1 rss: 1385.734375 MB pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 1538.9375 MB 3 rss: 1546.4375 MB pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3 ``` ================================ Arrow version: this PR pre_buffer=False ```python 0 rss: 87.8125 MB 1 rss: 1431.546875 MB pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 MB c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 ... c572 c573 c574 c575 c576 c577 c578 c579 c580 c581 c582 c583 c584 c585 c586 c587 c588 c589 c590 c591 c592 c593 c594 c595 c596 c597 c598 c599 0 125000 ... None None None None None None None None None None None None None None None None None None None None None None None None None None None None [1 rows x 600 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 1 entries, 0 to 0 Columns: 600 entries, c0 to c599 dtypes: object(600) memory usage: 23.9 KB 2 rss: 1570.390625 MB 3 rss: 1573.8125 MB pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
