[
https://issues.apache.org/jira/browse/ARROW-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438905#comment-17438905
]
Kanstantsin Ilchanka commented on ARROW-13687:
----------------------------------------------
I build apache-arrow locally and it works with S3, thanks!. However is it
possible to update brew formulae without updating it's version, is it safe to
use --force?
Questions:
* How can I pass access_token/secret_key so that I can access private files?
* How can I read/write by partitions?
Also I did some testing, here are problems that I found:
* Speed is very slow. It took almost 1 hour to download 400 Mb file through S3
compared to 15 seconds via usual Net::HTTP. For small files diff is not so
huge. Here is benchmark with small file. Maybe it is somehow connected that I
tested it with custom brew build?
{code:java}
require 'arrow-dataset'
require 'net/http'
require 'benchmark/ips'
s3_uri = URI("s3://simpl1g-example/correct.csv")
http_uri =
URI("https://simpl1g-example.s3.eu-central-1.amazonaws.com/correct.csv")
Benchmark.ips do |x|
x.report('S3') { Arrow::Table.load(s3_uri) }
x.report('Http') {
Arrow::Table.load(Arrow::Buffer.new(Net::HTTP.get(http_uri)), format: :csv) }
x.compare!
end
# Comparison:
# Http: 9.6 i/s
# S3: 4.9 i/s - 1.97x slower
{code}
* Not sure if it is real problem, but I can't cancel downloading big objects,
process stucks until download finished, now it takes hours, I guess because of
slow read, I can only do `kill -9` for process.
{code:java}
Arrow::Table.load(URI("s3://big-parquet-file.parquet"))
{code}
* Doing S3 call doesn't work the same as for local file. I have TSV file with
.csv extension. Parsing local file works fine. On S3 it fails
{code:java}
# Works fine
Arrow::Table.load("file.csv", delimiter: "\t")
Arrow::Table.load("file.csv", format: :tsv)
{code}
{code:java}
Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), delimiter: "\t")
gobject-introspection-3.4.9/lib/gobject-introspection/loader.rb:616:in
`invoke': [file-system-dataset-factory][finish]: Invalid: Error creating
dataset. Could not read schema from 'simpl1g-example/file.csv': Could not open
CSV input source 'simpl1g-example/file.csv': Invalid: CSV parse error: Row #2:
Expected 1 columns, got 2: 6 18 iPhone9,2 1635840547. Is this a
'csv' file? (Arrow::Error::Invalid)
Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), format: :tsv)
Traceback (most recent call last):
23: from bin/console:5:in `<main>'
6: from (irb):22:in `<main>'
5: from
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table.rb:29:in
`load'
4: from
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:24:in
`load'
3: from
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:56:in
`load'
2: from
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:35:in
`load_from_uri'
1: from
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:39:in
`internal_load_from_uri'
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/file-format.rb:39:in
`resolve': undefined method `[]' for nil:NilClass (NoMethodError)
{code}
> [Ruby] Add support for loading table by Arrow Dataset
> -----------------------------------------------------
>
> Key: ARROW-13687
> URL: https://issues.apache.org/jira/browse/ARROW-13687
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Ruby
> Reporter: Kouhei Sutou
> Assignee: Kouhei Sutou
> Priority: Major
> Labels: pull-request-available
> Fix For: 6.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)