whcdjj commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1445021515

   > My test is spark.sql("select * from store_sales order by ss_customer_sk 
limit 10"), store_sales is table of 1TB TP-CDS.
   
   Parquet-io  and parquet-process  threads is hardcoded given like this in 
ParquetReadOptions.java
   public static class Builder {
       protected ExecutorService ioThreadPool = Executors.newFixedThreadPool(4);
       protected ExecutorService processThreadPool = 
Executors.newFixedThreadPool(4);
   }
   
   I also take the follow test with local filesystem using 100GB TP-CDS 
store_sales table,and I see there is a degradation with async io feature. 
   test("parquet reader select") {
       val sc = SparkSession.builder().master("local[4]").getOrCreate()
       val df = sc.read.parquet("file:///D:\\work\\test\\tpcds\\store_sales")
       df.createOrReplaceTempView("table")
       val start = System.currentTimeMillis()
       sc.sql("select * from table order by ss_customer_sk limit 10").show()
       val end = System.currentTimeMillis()
       System.out.println("time: " + (end - start))
     }
   without this feature -> time: 7240
   with this feature -> time: 19923
   
   Threads are as expected 
   
![image](https://user-images.githubusercontent.com/87682445/221344885-a49cc5f2-9eba-4f0d-bc06-7615416d5b02.png)
   
   What process did I go wrong and can you show me the correct way to use this 
feature?
   
      
   
   
   
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to