FelixYBW commented on PR #5447:
URL:
https://github.com/apache/incubator-gluten/pull/5447#issuecomment-2106536617
Th
> TPCH SF2000 Q6 performance, query: `select sum(l_extendedprice *
l_discount) as revenue from lineitem where l_shipdate >= '1994-01-01' and
l_shipdate < '1995-01-01' and l_discount between .06 - 0.01 and .06 + 0.01 and
l_quantity < 24`
>
> lineitem data: 622G
>
> csv gluten without native reader csv gluten native csv reader
> 8333.039907 2456
> Test script:
>
> ```
> val schema = new StructType().add("l_orderkey", LongType).add("l_partkey",
LongType).add("l_suppkey", LongType).add("l_linenumber",
LongType).add("l_quantity", DoubleType).add("l_extendedprice",
DoubleType).add("l_discount", DoubleType).add("l_tax",
DoubleType).add("l_returnflag", StringType).add("l_linestatus",
StringType).add("l_shipdate", DateType).add("l_commitdate",
DateType).add("l_receiptdate", DateType).add("l_shipinstruct",
StringType).add("l_shipmode", StringType).add("l_comment", StringType)
>
> val lineitem =
spark.read.format("csv").option("header","true").schema(schema).load("file:///mnt/DP_disk2/tpch/csvdata/")
> spark.sql(q6)
> ```
>
> Note: because the file schema should match Arrow schema, so we should
specify the schema by `.schema(arrow_matched_schema)`
Thank you, Chengcheng. What's the vanilla spark performance in this case?
And how many task threads did you use?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]