whcdjj commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1445021515
> My test is spark.sql("select * from store_sales order by ss_customer_sk
limit 10"), store_sales is table of 1TB TP-CDS.
Parquet-io and parquet-process threads is hardcoded given like this in
ParquetReadOptions.java
public static class Builder {
protected ExecutorService ioThreadPool = Executors.newFixedThreadPool(4);
protected ExecutorService processThreadPool =
Executors.newFixedThreadPool(4);
}
I also take the follow test with local filesystem using 100GB TP-CDS
store_sales table,and I see there is a degradation with async io feature.
test("parquet reader select") {
val sc = SparkSession.builder().master("local[4]").getOrCreate()
val df = sc.read.parquet("file:///D:\\work\\test\\tpcds\\store_sales")
df.createOrReplaceTempView("table")
val start = System.currentTimeMillis()
sc.sql("select * from table order by ss_customer_sk limit 10").show()
val end = System.currentTimeMillis()
System.out.println("time: " + (end - start))
}
without this feature -> time: 7240
with this feature -> time: 19923
Threads are as expected

What process did I go wrong and can you show me the correct way to use this
feature?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]