[jira] [Updated] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Sahil Gupta (Jira) Thu, 21 Apr 2022 11:36:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sahil Gupta updated ARROW-16272:
--------------------------------
    Description: 
`pyarrow.fs.S3FileSystem.open_input_file` and 
`pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with 
Pandas' `read_csv`.

 

```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```

Output:
```shell
Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.

Getting similar performance with `fs.open_input_stream` as well (commented out 
in the code).
```shell
Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```

When running it with just pandas (which uses `s3fs` under the hood), it's much 
faster:
```python
import pandas as pd
import time

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```

Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```

 

Packages:

```

pyarrow=7.0.0

pandas : 1.4.2

numpy : 1.20.3

```

 

I tested it with 4.0.1, 5.0.0 as well and saw similar results.

  was:
`pyarrow.fs.S3FileSystem.open_input_file` and 
`pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with 
Pandas' `read_csv`.

 

```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```

Output:
```shell
Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.

Getting similar performance with `fs.open_input_stream` as well (commented out 
in the code).
```shell
Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```

When running it with just pandas (which uses `s3fs` under the hood), it's much 
faster:
```python
import pandas as pd
import time


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```

Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```

 

Packages:

```

pyarrow=7.0.0

pandas : 1.4.2

numpy : 1.20.3

```


> Poor read performance of S3FileSystem.open_input_file when used with 
> `pd.read_csv`
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-16272
>                 URL: https://issues.apache.org/jira/browse/ARROW-16272
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 5.0.0, 7.0.0
>         Environment: MacOS 12.1
> MacBook Pro
> Intel x86
>            Reporter: Sahil Gupta
>            Priority: Major
>              Labels: S3FileSystem, csv, pandas, s3
>
> `pyarrow.fs.S3FileSystem.open_input_file` and 
> `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used 
> with Pandas' `read_csv`.
>  
> ```python
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = S3FileSystem(
>         anonymous=True,
>         region="us-east-2",
>         endpoint_override=None,
>         proxy_options=None,
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     # fhandler = fs.open_input_stream(
>     #     
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     # )
>     fhandler = fs.open_input_file(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> Time to create fs:  0.0003612041473388672
> Time to create fhandler:  0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> ```
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented 
> out in the code).
> ```shell
> Running...
> Time to create fs:  0.0002570152282714844
> Time to create fhandler:  0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> ```
> When running it with just pandas (which uses `s3fs` under the hood), it's 
> much faster:
> ```python
> import pandas as pd
> import time
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         
> "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> ```
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
> performance:
> ```python
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
>     print("Running...")
>     t0 = time.time()
>     fs = ArrowFSWrapper(
>         S3FileSystem(
>             anonymous=True,
>             region="us-east-2",
>             endpoint_override=None,
>             proxy_options=None,
>         )
>     )
>     print("Time to create fs: ", time.time() - t0)
>     t0 = time.time()
>     fhandler = fs._open(
>         
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
>     )
>     print("Time to create fhandler: ", time.time() - t0)
>     t0 = time.time()
>     year_2016_df = pd.read_csv(
>         fhandler,
>         nrows=100,
>     )
>     print("read time:", time.time() - t0)
>     return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> Time to create fs:  0.0002467632293701172
> Time to create fhandler:  0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> ```
>  
> Packages:
> ```
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> ```
>  
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

Reply via email to