[GitHub] [arrow] aucahuasi commented on pull request #14226: ARROW-17599: [C++] Change the way how arrow reads parquet buffered files

GitBox Fri, 14 Oct 2022 02:07:17 -0700


aucahuasi commented on PR #14226:
URL: https://github.com/apache/arrow/pull/14226#issuecomment-1278714058


   I also experimented with the python script provided in this 
related/duplicated Jira ticket: 
https://issues.apache.org/jira/browse/ARROW-17590
   I ran 3 times the python script using this PR and I notice that we are using 
less memory for reading prebuffered files now, here are the details:
   
   ================================
   Arrow version: master
   pre_buffer=True
   ```python
   0 rss:  88.390625 MB
   1 rss:  1374.640625 MB
   pa.total_allocated_bytes 43.61480712890625 MB dt.nbytes 
0.0014410018920898438 MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  1294.765625 MB
   3 rss:  1294.765625 MB
   pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3
   ```
   ================================
   Arrow version: this PR
   pre_buffer=True
   1st run
   ```python
   0 rss:  87.5 MB
   1 rss:  728.921875 MB
   pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 
MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  731.375 MB
   3 rss:  731.375 MB
   pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3
   ```
   ================================
   Arrow version: this PR
   pre_buffer=True
   2nd run
   ```python
   0 rss:  87.703125 MB
   1 rss:  729.5 MB
   pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 
MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  610.328125 MB
   3 rss:  610.328125 MB
   pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3
   ```
   ================================
   Arrow version: this PR
   pre_buffer=True
   3rd run
   ```python
   0 rss:  87.484375 MB
   1 rss:  729.859375 MB
   pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 
MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  732.34375 MB
   3 rss:  732.34375 MB
   pyarrow 10.0.0.dev4072+gc32f988f5.d20221014 pandas 1.5.0 numpy 1.23.3
   ```
   ================================
   Arrow version: master
   pre_buffer=False
   ```python
   0 rss:  87.828125 MB
   1 rss:  1385.734375 MB
   pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 
MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  1538.9375 MB
   3 rss:  1546.4375 MB
   pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3
   ```
   ================================
   Arrow version: this PR
   pre_buffer=False
   ```python
   0 rss:  87.8125 MB
   1 rss:  1431.546875 MB
   pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 
MB
          c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 
c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  
c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  
c592  c593  c594  c595  c596  c597  c598  c599
   0  125000                                                                    
                                 ...  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None  None  None  None  None  None  
None  None  None  None  None  None  None  None
   
   [1 rows x 600 columns]
   <class 'pandas.core.frame.DataFrame'>
   RangeIndex: 1 entries, 0 to 0
   Columns: 600 entries, c0 to c599
   dtypes: object(600)
   memory usage: 23.9 KB
   2 rss:  1570.390625 MB
   3 rss:  1573.8125 MB
   pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] aucahuasi commented on pull request #14226: ARROW-17599: [C++] Change the way how arrow reads parquet buffered files

Reply via email to