fpacanowski commented on issue #45117:
URL: https://github.com/apache/arrow/issues/45117#issuecomment-2564666463

   Thank you! Based on your advice I came up with this alternative 
implementation:
   ```ruby
   def read_parquet2
     table = Arrow::TableLoader.load('data.parquet', { format: :parquet })
     result = []
     table.each_record_batch do |record_batch|
       result.concat(record_batch['foo'].data.to_a.map { {foo: _1} })
     end
     result
   end
   ```
   Obviously this implementation is specialized to this particular schema. But 
it's orders of magnitude faster (even faster than pyarrow).
   ```
   $ bundle exec ruby --yjit read.rb
   Rehearsal -------------------------------------------------
   read_parquet   17.731458   0.070106  17.801564 ( 17.830338)
   read_parquet2   0.218525   0.019901   0.238426 (  0.239175)
   --------------------------------------- total: 18.039990sec
   
                       user     system      total        real
   read_parquet   21.314701   0.113192  21.427893 ( 21.462763)
   read_parquet2   0.169205   0.013307   0.182512 (  0.183047)
   ```
   
   > In general, you should not convert large data to raw Ruby objects. It's 
slow as you seen.
   
   I'm aware there's a certain overhead to converting data to Ruby objects. But 
it seems to me it's not the bottleneck - certainly creating a million Hashes 
doesn't take ~20 seconds. When I looked at the code of `Record#to_h`, I 
realized calling `.to_h` on each record separately in my original 
implementation caused repeatedly fetching the whole column.
   ```ruby
   # record.rb
   
   def to_h
     attributes = {}
     @container.columns.each do |column|
       attributes[column.name] = column[@index]
     end
     attributes
   end
   ```
   
   > Are you really need an Array of Hashes?
   
   In my use case I want to load data from Parquet file into a database. So I 
need an Array of Hashes because this is what ActiveRecord's 
[insert_all](https://apidock.com/rails/v7.0.0/ActiveRecord/Persistence/ClassMethods/insert_all)
 accepts. I think it would be nice to provide a built-in method for doing this 
in an optimized way. I filed a separate feature request: 
https://github.com/apache/arrow/issues/45122.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to