whcdjj commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1445021515
> My test is spark.sql("select * from store_sales order by ss_customer_sk limit 10"), store_sales is table of 1TB TP-CDS. Parquet-io and parquet-process threads is hardcoded given like this in ParquetReadOptions.java public static class Builder { protected ExecutorService ioThreadPool = Executors.newFixedThreadPool(4); protected ExecutorService processThreadPool = Executors.newFixedThreadPool(4); } I also take the follow test with local filesystem using 100GB TP-CDS store_sales table,and I see there is a degradation with async io feature. test("parquet reader select") { val sc = SparkSession.builder().master("local[4]").getOrCreate() val df = sc.read.parquet("file:///D:\\work\\test\\tpcds\\store_sales") df.createOrReplaceTempView("table") val start = System.currentTimeMillis() sc.sql("select * from table order by ss_customer_sk limit 10").show() val end = System.currentTimeMillis() System.out.println("time: " + (end - start)) } without this feature -> time: 7240 with this feature -> time: 19923 Threads are as expected ![image](https://user-images.githubusercontent.com/87682445/221344885-a49cc5f2-9eba-4f0d-bc06-7615416d5b02.png) What process did I go wrong and can you show me the correct way to use this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org