[jira] [Commented] (ARROW-13687) [Ruby] Add support for loading table by Arrow Dataset

Kanstantsin Ilchanka (Jira) Thu, 04 Nov 2021 12:35:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438905#comment-17438905
 ]


Kanstantsin Ilchanka commented on ARROW-13687:
----------------------------------------------

I build apache-arrow locally and it works with S3, thanks!. However is it 
possible to update brew formulae without updating it's version, is it safe to 
use --force?

Questions:
 * How can I pass access_token/secret_key so that I can access private files?
 * How can I read/write by partitions?

Also I did some testing, here are problems that I found:
 * Speed is very slow. It took almost 1 hour to download 400 Mb file through S3 
compared to 15 seconds via usual Net::HTTP. For small files diff is not so 
huge. Here is benchmark with small file. Maybe it is somehow connected that I 
tested it with custom brew build?
{code:java}
require 'arrow-dataset'
require 'net/http'
require 'benchmark/ips'

s3_uri = URI("s3://simpl1g-example/correct.csv")
http_uri = 
URI("https://simpl1g-example.s3.eu-central-1.amazonaws.com/correct.csv";)
Benchmark.ips do |x|
  x.report('S3') { Arrow::Table.load(s3_uri) }
  x.report('Http') { 
Arrow::Table.load(Arrow::Buffer.new(Net::HTTP.get(http_uri)), format: :csv) }
  x.compare!
end
# Comparison:
#                 Http:        9.6 i/s
#                   S3:        4.9 i/s - 1.97x  slower
{code}

 * Not sure if it is real problem, but I can't cancel downloading big objects, 
process stucks until download finished, now it takes hours, I guess because of 
slow read, I can only do `kill -9` for process.

 
{code:java}
Arrow::Table.load(URI("s3://big-parquet-file.parquet"))
{code}
 
 * Doing S3 call doesn't work the same as for local file. I have TSV file with 
.csv extension. Parsing local file works fine. On S3 it fails
{code:java}
# Works fine
Arrow::Table.load("file.csv", delimiter: "\t")
Arrow::Table.load("file.csv", format: :tsv)
{code}
{code:java}
Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), delimiter: "\t")
gobject-introspection-3.4.9/lib/gobject-introspection/loader.rb:616:in 
`invoke': [file-system-dataset-factory][finish]: Invalid: Error creating 
dataset. Could not read schema from 'simpl1g-example/file.csv': Could not open 
CSV input source 'simpl1g-example/file.csv': Invalid: CSV parse error: Row #2: 
Expected 1 columns, got 2: 6      18      iPhone9,2       1635840547. Is this a 
'csv' file? (Arrow::Error::Invalid)

Arrow::Table.load(URI("s3://simpl1g-example/file.csv"), format: :tsv)
Traceback (most recent call last):
        23: from bin/console:5:in `<main>'
         6: from (irb):22:in `<main>'
         5: from 
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table.rb:29:in
 `load'
         4: from 
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:24:in
 `load'
         3: from 
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-6.0.0/lib/arrow/table-loader.rb:56:in
 `load'
         2: from 
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:35:in
 `load_from_uri'
         1: from 
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/arrow-table-loadable.rb:39:in
 `internal_load_from_uri'
/Users/k.ilcenko/.rvm/gems/ruby-2.7.4/gems/red-arrow-dataset-6.0.0/lib/arrow-dataset/file-format.rb:39:in
 `resolve': undefined method `[]' for nil:NilClass (NoMethodError)
{code}
 

 

> [Ruby] Add support for loading table by Arrow Dataset
> -----------------------------------------------------
>
>                 Key: ARROW-13687
>                 URL: https://issues.apache.org/jira/browse/ARROW-13687
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Ruby
>            Reporter: Kouhei Sutou
>            Assignee: Kouhei Sutou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13687) [Ruby] Add support for loading table by Arrow Dataset

Reply via email to