[
https://issues.apache.org/jira/browse/ARROW-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
mimoune djouallah updated ARROW-17679:
--------------------------------------
Description:
I am using pyarrow and duckdb to query some parquet files in GCP, thanks for
making the experience so smooth, but I have an issue with the performance, see
code used.
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://xxxxx/lineitem")
lineitem_partition = ds.dataset("gs://xxxx/yyy",format="parquet",
partitioning="hive")
lineitem_180 = ds.dataset("gs://xxxxx/lineitem_180",format="parquet",
partitioning="hive")
con = duckdb.connect()
con.register("lineitem", lineitem)
con.register("lineitem_partition", lineitem_partition)
con.register("lineitem_180", lineitem_180)
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, \{'Content-Type':
'application/json'}
the issue is I am getting slow some extremely slow throughput performance,
around 30 MBper second, the same files using local ssd laptop is extremely fast.
I am not sure what's the issue, I tried using pyarrow compute Query and it is
the same performance
was:
I am using pyarrow and duckdb to query some parquet files in GCP, thanks for
making the experience so smooth, but I have an issue with the performance, see
code used.
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet",
partitioning="hive")
lineitem_180 = ds.dataset("gs://duckddelta/lineitem_180",format="parquet",
partitioning="hive")
con = duckdb.connect()
con.register("lineitem", lineitem)
con.register("lineitem_partition", lineitem_partition)
con.register("lineitem_180", lineitem_180)
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, \{'Content-Type':
'application/json'}
the issue is I am getting slow some extremely slow throughput performance,
around 30 MBper second, the same files using local ssd laptop is extremely fast.
I am not sure what's the issue, I tried using pyarrow compute Query and it is
the same performance
> slow performance when reading data from GCP
> -------------------------------------------
>
> Key: ARROW-17679
> URL: https://issues.apache.org/jira/browse/ARROW-17679
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 9.0.0
> Reporter: mimoune djouallah
> Priority: Major
>
> I am using pyarrow and duckdb to query some parquet files in GCP, thanks for
> making the experience so smooth, but I have an issue with the performance,
> see code used.
> import pyarrow.dataset as ds
> import duckdb
> import json
> lineitem = ds.dataset("gs://xxxxx/lineitem")
> lineitem_partition = ds.dataset("gs://xxxx/yyy",format="parquet",
> partitioning="hive")
> lineitem_180 = ds.dataset("gs://xxxxx/lineitem_180",format="parquet",
> partitioning="hive")
> con = duckdb.connect()
> con.register("lineitem", lineitem)
> con.register("lineitem_partition", lineitem_partition)
> con.register("lineitem_180", lineitem_180)
> def Query(request):
> SQL = request.get_json().get('name')
> df = con.execute(SQL).df()
> return json.dumps(df.to_json(orient="records")), 200, \{'Content-Type':
> 'application/json'}
>
> the issue is I am getting slow some extremely slow throughput performance,
> around 30 MBper second, the same files using local ssd laptop is extremely
> fast.
> I am not sure what's the issue, I tried using pyarrow compute Query and it is
> the same performance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)