fpacanowski commented on issue #45117:
URL: https://github.com/apache/arrow/issues/45117#issuecomment-2564666463
Thank you! Based on your advice I came up with this alternative
implementation:
```ruby
def read_parquet2
table = Arrow::TableLoader.load('data.parquet', { format: :parquet })
result = []
table.each_record_batch do |record_batch|
result.concat(record_batch['foo'].data.to_a.map { {foo: _1} })
end
result
end
```
Obviously this implementation is specialized to this particular schema. But
it's orders of magnitude faster (even faster than pyarrow).
```
$ bundle exec ruby --yjit read.rb
Rehearsal -------------------------------------------------
read_parquet 17.731458 0.070106 17.801564 ( 17.830338)
read_parquet2 0.218525 0.019901 0.238426 ( 0.239175)
--------------------------------------- total: 18.039990sec
user system total real
read_parquet 21.314701 0.113192 21.427893 ( 21.462763)
read_parquet2 0.169205 0.013307 0.182512 ( 0.183047)
```
> In general, you should not convert large data to raw Ruby objects. It's
slow as you seen.
I'm aware there's a certain overhead to converting data to Ruby objects. But
it seems to me it's not the bottleneck - certainly creating a million Hashes
doesn't take ~20 seconds. When I looked at the code of `Record#to_h`, I
realized calling `.to_h` on each record separately in my original
implementation caused repeatedly fetching the whole column.
```ruby
# record.rb
def to_h
attributes = {}
@container.columns.each do |column|
attributes[column.name] = column[@index]
end
attributes
end
```
> Are you really need an Array of Hashes?
In my use case I want to load data from Parquet file into a database. So I
need an Array of Hashes because this is what ActiveRecord's
[insert_all](https://apidock.com/rails/v7.0.0/ActiveRecord/Persistence/ClassMethods/insert_all)
accepts. I think it would be nice to provide a built-in method for doing this
in an optimized way. I filed a separate feature request:
https://github.com/apache/arrow/issues/45122.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]