lwhite1 commented on issue #12618:
URL: https://github.com/apache/arrow/issues/12618#issuecomment-1183432653

   I think this map-based, approach is wonderfully friendly for Java devs, as 
is the JDBC-like syntax.  I will just mention here a method that may be more GC 
friendly. Apologies if this takes the conversation in an unhelpful direction.
   
   In the Tablesaw dataframe, row-oriented access is provided by an object 
called a “Row” (although it should probably have been called a Curser).  The 
intent was to minimize the memory overhead of row-based access since 
instantiating real rows would cause memory to grow by some multiple of the 
original dataframe size.  The read operations look like this: 
   
   ```
     Table t = ....;
     for (Row row: t) {
        int age = row.getInt("age");          // no boxing
        String name = row.getString("name");  // retrieve from dictionary 
encoding
        // do whatever else you want to do.
     } 
   
   ```
   You can also access a row by index:
   
   ```
      Row r = t.row(43); 
   ```
   and move the index programmatically, if needed. 
   
   The advantage is that 
   - the row object is created only once. It just gets an index updated as it 
moves.
   - rows don't have a column name attribute for each column
   - the primitive values are accessed without boxing,
   - primitive encoded values like LocalDate, can either be retrieved as 
LocalDate objects or as encoded primitives, if you don't need the whole object. 
   - you can combine iteration with filtering and postpone/avoid instantiation 
of some objects until they're needed.  
   
   You can also update using the API: 
   ```
      for (Row row: t) {
         int age = row.getInt("age");
         if (age >= 18) {
            row.setBoolean("adult", true);
         }
      } 
   ```
   
   Row-based inserts are performed using the same API by asking the table for a 
new row:
   ```
      Row r = t.appendRow();     // adds a new 'cell' to every column in the 
table, and returns the row pointing to those cells
      r.setString("name", "Joe"); 
   ```
   
   The main disadvantages I see are that 
   - (a) the "Row" object cannot be safely passed around like a map; you need 
to use it in one thread and extract whatever data you need there. 
   - The operations are not as obvious to new users as a method based on 
returning maps. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to