FelixYBW commented on PR #5447:
URL: 
https://github.com/apache/incubator-gluten/pull/5447#issuecomment-2106536617

   Th
   
   > TPCH SF2000 Q6 performance, query: `select sum(l_extendedprice * 
l_discount) as revenue from lineitem where l_shipdate >= '1994-01-01' and 
l_shipdate < '1995-01-01' and l_discount between .06 - 0.01 and .06 + 0.01 and 
l_quantity < 24`
   > 
   > lineitem data: 622G
   > 
   > csv gluten without native reader   csv gluten native csv reader
   > 8333.039907        2456
   > Test script:
   > 
   > ```
   > val schema = new StructType().add("l_orderkey", LongType).add("l_partkey", 
LongType).add("l_suppkey", LongType).add("l_linenumber", 
LongType).add("l_quantity", DoubleType).add("l_extendedprice", 
DoubleType).add("l_discount", DoubleType).add("l_tax", 
DoubleType).add("l_returnflag", StringType).add("l_linestatus", 
StringType).add("l_shipdate", DateType).add("l_commitdate", 
DateType).add("l_receiptdate", DateType).add("l_shipinstruct", 
StringType).add("l_shipmode", StringType).add("l_comment", StringType)
   > 
   > val lineitem = 
spark.read.format("csv").option("header","true").schema(schema).load("file:///mnt/DP_disk2/tpch/csvdata/")
   > spark.sql(q6)
   > ```
   > 
   > Note: because the file schema should match Arrow schema, so we should 
specify the schema by `.schema(arrow_matched_schema)`
   
   Thank you, Chengcheng. What's the vanilla spark performance in this case? 
And how many task threads did you use?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to