[jira] [Commented] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Antoine Pitrou (Jira) Mon, 30 May 2022 06:56:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17543951#comment-17543951
 ]


Antoine Pitrou commented on ARROW-16272:
----------------------------------------

The use case is fixed with https://github.com/apache/arrow/pull/13264 :

{code}
Running...
Time to create fs:  2.0029425621032715
Time to create fhandler:  0.4456977844238281
read time: 0.5826966762542725
    Summons Number Plate ID Registration State Plate Type  Issue Date  
Violation Code  ... Community Board Community Council  Census Tract  BIN  BBL  
NTA
0       1363745270  GGY6450                 99        PAS  07/09/2015           
   46  ...             NaN                NaN          NaN  NaN  NaN  NaN
1       1363745293   KXD355                 SC        PAS  07/09/2015           
   21  ...             NaN                NaN          NaN  NaN  NaN  NaN
2       1363745438  JCK7576                 PA        PAS  07/09/2015           
   21  ...             NaN                NaN          NaN  NaN  NaN  NaN
3       1363745475  GYK7658                 NY        OMS  07/09/2015           
   21  ...             NaN                NaN          NaN  NaN  NaN  NaN
4       1363745487  GMT8141                 NY        PAS  07/09/2015           
   21  ...             NaN                NaN          NaN  NaN  NaN  NaN
..             ...      ...                ...        ...         ...           
  ...  ...             ...                ...          ...  ...  ...  ...
95      1363748464  GFV8489                 NY        PAS  07/09/2015           
   21  ...             NaN                NaN          NaN  NaN  NaN  NaN
96      1363748476   X15EGU                 NJ        PAS  07/09/2015           
   20  ...             NaN                NaN          NaN  NaN  NaN  NaN
97      1363748490  GDM1774                 NY        PAS  07/09/2015           
   38  ...             NaN                NaN          NaN  NaN  NaN  NaN
98      1363748531   G45DSY                 NJ        PAS  07/09/2015           
   37  ...             NaN                NaN          NaN  NaN  NaN  NaN
99      1363748579   RR76Y0                 PA        PAS  07/09/2015           
   20  ...             NaN                NaN          NaN  NaN  NaN  NaN

[100 rows x 51 columns]
total time: 3.0595762729644775
{code}

> [C++][Python] Poor read performance of S3FileSystem.open_input_file when used 
> with `pd.read_csv`
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16272
>                 URL: https://issues.apache.org/jira/browse/ARROW-16272
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 4.0.1, 5.0.0, 7.0.0
>         Environment: MacOS 12.1
> MacBook Pro
> Intel x86
>            Reporter: Sahil Gupta
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: S3FileSystem, csv, pandas, pull-request-available, s3
>             Fix For: 9.0.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> `pyarrow.fs.S3FileSystem.open_input_file` and 
> `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used 
> with Pandas' `read_csv`.
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = S3FileSystem(
>         anonymous=True,
>         region="us-east-2",
>         endpoint_override=None,
>         proxy_options=None,
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     # fhandler = fs.open_input_stream(
>     #     
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     # )
>     fhandler = fs.open_input_file(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0003612041473388672
> Time to create fhandler:  0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> {code}
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented 
> out in the code).
> {code}
> Running...
> Time to create fs:  0.0002570152282714844
> Time to create fhandler:  0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> {code}
> When running it with just pandas (which uses `s3fs` under the hood), it's 
> much faster:
> {code:python}
> import pandas as pd
> import time
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         
> "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> {code}
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
> performance:
> {code:python}
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = ArrowFSWrapper(
>         S3FileSystem(
>             anonymous=True,
>             region="us-east-2",
>             endpoint_override=None,
>             proxy_options=None,
>         )
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     fhandler = fs._open(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> {code}
> Output:
> {code}
> Running...
> Time to create fs:  0.0002467632293701172
> Time to create fhandler:  0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> {code}
> Packages:
> {code}
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> {code}
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16272) [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Reply via email to