[
https://issues.apache.org/jira/browse/ARROW-16272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Gupta updated ARROW-16272:
--------------------------------
Description:
`pyarrow.fs.S3FileSystem.open_input_file` and
`pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with
Pandas' `read_csv`.
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
# fhandler = fs.open_input_stream(
#
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
# )
fhandler = fs.open_input_file(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0003612041473388672
Time to create fhandler: 0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.
Getting similar performance with `fs.open_input_stream` as well (commented out
in the code).
```shell
Running...
Time to create fs: 0.0002570152282714844
Time to create fhandler: 0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```
When running it with just pandas (which uses `s3fs` under the hood), it's much
faster:
```python
import pandas as pd
import time
def load_parking_tickets():
print("Running...")
t0 = time.time()
year_2016_df = pd.read_csv(
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```
Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs
performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = ArrowFSWrapper(
S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
fhandler = fs._open(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0002467632293701172
Time to create fhandler: 0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```
Packages:
```
pyarrow=7.0.0
pandas : 1.4.2
numpy : 1.20.3
```
I tested it with 4.0.1, 5.0.0 as well and saw similar results.
was:
`pyarrow.fs.S3FileSystem.open_input_file` and
`pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with
Pandas' `read_csv`.
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
# fhandler = fs.open_input_stream(
#
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
# )
fhandler = fs.open_input_file(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0003612041473388672
Time to create fhandler: 0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.
Getting similar performance with `fs.open_input_stream` as well (commented out
in the code).
```shell
Running...
Time to create fs: 0.0002570152282714844
Time to create fhandler: 0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```
When running it with just pandas (which uses `s3fs` under the hood), it's much
faster:
```python
import pandas as pd
import time
def load_parking_tickets():
print("Running...")
t0 = time.time()
year_2016_df = pd.read_csv(
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```
Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs
performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper
def load_parking_tickets():
print("Running...")
t0 = time.time()
fs = ArrowFSWrapper(
S3FileSystem(
anonymous=True,
region="us-east-2",
endpoint_override=None,
proxy_options=None,
)
)
print("Time to create fs: ", time.time() - t0)
t0 = time.time()
fhandler = fs._open(
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
)
print("Time to create fhandler: ", time.time() - t0)
t0 = time.time()
year_2016_df = pd.read_csv(
fhandler,
nrows=100,
)
print("read time:", time.time() - t0)
return year_2016_df
t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs: 0.0002467632293701172
Time to create fhandler: 0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```
Packages:
```
pyarrow=7.0.0
pandas : 1.4.2
numpy : 1.20.3
```
> Poor read performance of S3FileSystem.open_input_file when used with
> `pd.read_csv`
> ----------------------------------------------------------------------------------
>
> Key: ARROW-16272
> URL: https://issues.apache.org/jira/browse/ARROW-16272
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 4.0.1, 5.0.0, 7.0.0
> Environment: MacOS 12.1
> MacBook Pro
> Intel x86
> Reporter: Sahil Gupta
> Priority: Major
> Labels: S3FileSystem, csv, pandas, s3
>
> `pyarrow.fs.S3FileSystem.open_input_file` and
> `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used
> with Pandas' `read_csv`.
>
> ```python
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> fs = S3FileSystem(
> anonymous=True,
> region="us-east-2",
> endpoint_override=None,
> proxy_options=None,
> )
> print("Time to create fs: ", time.time() - t0)
> t0 = time.time()
> # fhandler = fs.open_input_stream(
> #
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> # )
> fhandler = fs.open_input_file(
>
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> )
> print("Time to create fhandler: ", time.time() - t0)
> t0 = time.time()
> year_2016_df = pd.read_csv(
> fhandler,
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> Time to create fs: 0.0003612041473388672
> Time to create fhandler: 0.22461509704589844
> read time: 105.76488208770752
> total time: 105.99135684967041
> ```
> This is with `pandas==1.4.2`.
> Getting similar performance with `fs.open_input_stream` as well (commented
> out in the code).
> ```shell
> Running...
> Time to create fs: 0.0002570152282714844
> Time to create fhandler: 0.18540692329406738
> read time: 186.8419930934906
> total time: 187.03169012069702
> ```
> When running it with just pandas (which uses `s3fs` under the hood), it's
> much faster:
> ```python
> import pandas as pd
> import time
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> year_2016_df = pd.read_csv(
>
> "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> read time: 1.1012001037597656
> total time: 1.101264238357544
> ```
> Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs
> performance:
> ```python
> import pandas as pd
> import time
> from pyarrow.fs import S3FileSystem
> from fsspec.implementations.arrow import ArrowFSWrapper
> def load_parking_tickets():
> print("Running...")
> t0 = time.time()
> fs = ArrowFSWrapper(
> S3FileSystem(
> anonymous=True,
> region="us-east-2",
> endpoint_override=None,
> proxy_options=None,
> )
> )
> print("Time to create fs: ", time.time() - t0)
> t0 = time.time()
> fhandler = fs._open(
>
> "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
> )
> print("Time to create fhandler: ", time.time() - t0)
> t0 = time.time()
> year_2016_df = pd.read_csv(
> fhandler,
> nrows=100,
> )
> print("read time:", time.time() - t0)
> return year_2016_df
> t0 = time.time()
> load_parking_tickets()
> print("total time:", time.time() - t0)
> ```
> Output:
> ```shell
> Running...
> Time to create fs: 0.0002467632293701172
> Time to create fhandler: 0.1858382225036621
> read time: 0.13701486587524414
> total time: 0.3232450485229492
> ```
>
> Packages:
> ```
> pyarrow=7.0.0
> pandas : 1.4.2
> numpy : 1.20.3
> ```
>
> I tested it with 4.0.1, 5.0.0 as well and saw similar results.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)