[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-04-12 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1506058505

   FWIW the hadoop 3.3.5 vector io changes might make this PR redundant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-03-25 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1483892749

   > @parthchandra Do you have time to resolve the conflicts? I think it would 
be nice to be included in the next release.
   
   Done
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-03-23 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1482127933

   > Sorry to interrupt, just wondering if this PR can work with 
[S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html)
 by any chance? @parthchandra
   > 
   > Thanks!
   
   This PR uses the hdfs interface to s3 (s3a) if the url is s3a. I don't think 
that the s3a implementation uses any of the implementations of S3AsyncClient. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-02-27 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1446796605

   Also try a query like 
   ```
 select
   SUM(length(IFNULL(ss_sold_date_sk, ' '))),
   SUM(length(IFNULL(ss_sold_time_sk, ' '))),
   SUM(length(IFNULL(ss_item_sk, ' '))),
   SUM(length(IFNULL(ss_customer_sk, ' '))),
   SUM(length(IFNULL(ss_cdemo_sk, ' '))),
   SUM(length(IFNULL(ss_hdemo_sk, ' '))),
   SUM(length(IFNULL(ss_addr_sk, ' '))),
   SUM(length(IFNULL(ss_store_sk, ' '))),
   SUM(length(IFNULL(ss_promo_sk, ' '))),
   SUM(length(IFNULL(ss_ticket_number, ' '))),
   SUM(ss_quantity),
   SUM(ss_wholesale_Cost),
   SUM(ss_list_price ),
   SUM(ss_sales_price),
   SUM(ss_ext_discount_amt),
   SUM(ss_ext_sales_price),
   SUM(ss_ext_wholesale_cost),
   SUM(ss_ext_list_price),
   SUM(ss_ext_tax),
   SUM(ss_coupon_amt),
   SUM(ss_net_paid),
   SUM(ss_net_paid_inc_tax),
   SUM(ss_net_profit)
   from store_sales
   
   ```
   which avoids the expensive sort 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-02-25 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1445230916

   Looks correct to me. Couple of questions, are you running this on a cluster 
or on local system? Also, is the data on SSD's? If you are on a single machine, 
there might not be enough CPU and the async threads may be contending for time 
with the processing threads. We'll need some profiling info to get a better 
diagnosis. Also, with SSD's, reads are so fast from the file system that the 
async feature might show very little improvement. 
   You could turn on the debug logging level for FilePageReader and 
AsyncMultibufferInputStream.
   We can continue this in a thread outside this PR. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-02-24 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1444106912

   > Hi, I am very interested in this optimization and just have some questions 
when testing in a cluster with 4nodes/96 cores using spark3.1. Unfortunately, I 
see little improvement.
   
   You're likely to see improvement in cases where file i/o is the bottleneck. 
Most TPC-DS queries are join heavy and you will see little improvement there. 
You might do better with TPC-H. 
   
   > I am confused than whether it is necessary to keep 
spark.sql.parquet.enableVectorizedReader = false in spark when testing with 
spark 3.1 and how can I set the parquet buffer size. 
   
   It's probably best to keep the parquet (read) buffer size untouched.
   
   You should keep `spark.sql.parquet.enableVectorizedReader = true` 
irrespective of this. This feature improves I/O speed of reading raw data. The 
Spark vectorized reader kicks in after data is read from storage and converts 
the raw data into Spark's internal columnar representation and is faster than 
the row based version.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-25 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1327957394

   @wgtmac  Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-23 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1325768325

   Updated. The thread pools are now set thru ParquetReadOptions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-21 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322881480

   Let me verify that merging the thread pools is not dangerous and post an 
update. After that if be great if you can take it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-17 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1319411064

   > > IMO, switching `ioThreadPool` and `processThreadPool` the reader 
instance level will make it more flexible.
   > 
   > I've changed the thread pool so that it is not initialized by default but 
I left them as static members. Ideally, there should be a single IO thread pool 
that handles all the IO for a process and the size of the pool is determined by 
the bandwidthof the underlying storage system. Making them per instance is not 
an issue though. The calling code can decide to set the same thread pool for 
all instances and achieve the same result. Let me update this.
   > 
   > Also, any changes you want to make are fine with me, and the help is 
certainly appreciated !
   
   I'm thinking of merging the thread pools into a single ioThreadPool and 
making the ioThreadPool settable thru `ParquetReadOptions` (like the allocator 
is). The work being done by the processThreadPool is rather small and maybe we 
can do away with it. 
   Adding the pool via `ParquetReadOptions`  makes it easier to use with 
`ParquetReader` (used a lot in unit tests).
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-11-17 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1319040697

   @wgtmac thank you for looking at this. I don't have any more TODOs on this 
PR. 
   
   > * Adopt the incoming Hadoop vectored io api.
   
   This should be part of another PR. There is a draft PR (#999 ) open for 
this. Once that is merged in, I can revisit the async I/O code and incorporate 
the vectored io api. 
   In other experiments I have seen that async io gives better results over 
slower networks. With faster network connections, as  is the case where we are 
reading from S3 within an AWS environment, reading in parallel (as the vector 
io api does), starts to give better results. 
   I believe, that both should be available as options. 
   
   > * Benchmark against remote object stores from different cloud providers.
   
   The numbers I posted earlier were for reading from AWS/S3 over a 1 Gbps 
line. Reading from within AWS shows lesser improvement. I don't have an account 
with other cloud providers. Any help here would be appreciated. 
   
   > IMO, switching `ioThreadPool` and `processThreadPool` the reader instance 
level will make it more flexible.
   
   I've changed the thread pool so that it is not initialized by default but I 
left them as static members. Ideally, there should be a single IO thread pool 
that handles all the IO for a process and the size of the pool is determined by 
the bandwidthof the underlying storage system. 
   Making them per instance is not an issue though. The calling code can decide 
to set the same thread pool for all instances and achieve the same result. 
   Let me update this. 
   
   Also, any changes you want to make are fine with me, and the help is 
certainly appreciated !
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-06-20 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1160693399

   @shangxinli Thank you for the review! I'll address these comments asap.
   I am reviewing the thread pool and its initialization. IMO, it is better if 
there is no default initialization of the pool and the calling 
application/framework does so explicitly. One side effect of the default 
initialization is that the pool is created unnecessarily even if async is off. 
Also, if an application, shades and includes another copy of the library (or 
transitively, many more), then one more thread pool gets created for every 
version of the library included. 
   It is probably a better idea to allow the thread pool to be assigned as a 
per instance variable. The calling application can then decide to use a single 
pool for all instances or a new one per instance whichever use case is better 
for their performance.
   Finally, some large scale testing has revealed a possible resource leak. I'm 
looking into addressing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136552664

   > This is interesting, because when I did profiling of Trino, I found that 
although I/O (from S3, over the network no less) was significant, even more 
time was spent in compute. Maybe you're getting improved performance because 
you're increasing _parallelism_ between I/O and compute.
   
   It may be because I was using Spark's vectorized parquet decoding which is 
an order or magnitude faster than parquet library's row by row decoding (see 
[Spark 
benchmarks](https://github.com/apache/spark/blob/master/sql/core/benchmarks/DataSourceReadBenchmark-results.txt)).
 If trino is not doing vectorized decoding (I took a very quick look and I 
don't think it is), I would suggest you can look into that next. All the cool 
kids are doing it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136542868

   > So it sounds like you're optimizing one layer of processing, and I'm 
optimizing the next layer up, and it's kindof a coincidence that we're touching 
some of the same classes just because code reuse has been possible here.
   Yeah, kind of cool :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136531385

   BTW, adding more tests for the InputStream implementations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528026

   > Latency is the killer; in an HTTP request you want read enough but not 
discard data or break an http connection if the client suddenly does a seek() 
or readFully() somewhere else. file listings, existence checks etc.
   > 
   > That'd be great. now, if you could also handle requesting different 
columns in parallel and processing them out of order.
   
   I do. The Parquet file reader api that reads row groups in sync mode reads 
all columns in sequence. In async mode, it fires off a task for every column 
blocking only to read the first page of every column before returning. This 
part also uses a different thread pool from the IO tasks so that IO tasks never 
wait because there are no available threads in the thread pool.
   
   > 
   > be good to think about vectored IO.
   
   I think I know how to integrate this PR with the vectored IO, but this is 
only after a cursory look. 
   
   > 
   > and yes, updating parquet dependencies would be good, hadoop 3.3.0 should 
be the baseline.
   
   Who can drive this  (presumably) non-trivial change? I myself have no karma 
points :(

   > just sketched out my thoughts on this. I've played with some of this in my 
own branch. I think the next step would be for me to look at the benchmark code 
to make it targetable elsewhere.
   > 
   > 
https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/
   
   This is great. I now have much more context of where you are coming from 
(and going to) !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136439133

   > Is byte (and arrays and buffers of bytes) the only datatype you support? 
My PR is optimizing code paths that pull ints, longs, and other sizes out of 
the data buffers. Are those not necessary for any of the situations where 
you're using an async buffer?
   The input stream API is generally unaware of the datatypes of its contents 
and so those are the only apis I use. The other reason is that the 
ParquetFileReader returns Pages which basically contain metadata and 
ByteBuffers of _compressed_ data. The decompression and decoding into types 
comes much later in a downstream thread.
   For your PR, I don't think the AsyncMultibufferInputStream is every going to 
be in play in the paths you're optimizing. But just in case it is, your type 
aware methods will work as is because AsyncMultibufferInputStream is derived 
from MultiBufferInputStream and will inherit those methods.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-24 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136427603

   > thanks., that means you are current with all shipping improvments. the 
main one extra is to use openFile(), passing in length and requesting randomio. 
this guarantees ranged GET requests and cuts the initial HEAD probe for 
existence/size of file.
   
   By `openFile()` do you mean 
`FileSystem.openFileWithOptions(Path,OpenFileParameters)`?
   While looking I realized the Parquet builds with a [much older version of 
hadoop](https://github.com/apache/parquet-mr/blob/a2da156b251d13bce1fa81eb95b555da04880bc1/pom.xml#L79)
  
   > > > have you benchmarked this change with abfs or google gcs connectors to 
see what difference it makes there?
   > 
   > > No I have not. Would love help from anyone in the community with access 
to these. I only have access to S3.
   > 
   > that I have. FWIW, with the right tuning of abfs prefetch (4 threads, 128 
MB blocks) i can get full FTTH link rate from a remote store; 700 mbit/s . 
that's to the base station. once you add wifi the bottlenecks move.
   
   Wow! That is nearly as fast as local HDD. At this point the bottlenecks in 
parquet begin to move towards decompression and decoding but IO remains the 
slowest link in the chain.  One thing we get with my PR is that the 
ParquetFileReader had assumptions built in that all data must be read before 
downstream can proceed. Some of my changes are related to removing these 
assumptions and ensuring that downstream processing does not block until an 
entire column is read so we get efficient pipelining. 
   What does the 128 MB block mean? Is this the amount prefetched for a stream? 
The read API does not block until the entire block is filled, I presume. 
   With my PR, parquet IO is reading 8MB at a time (default) and downstream is 
processing 1MB at a time (default) and several such streams (one per column) 
are in progress at the same time. Hopefully, this read pattern would work with 
the prefetch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-23 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135366352

   @steveloughran thank you very much for taking the time to review and provide 
feedback! 
   
   > 1. whose s3 client was used for testing here -if the s3a one, which hadoop 
release?
   
   I was working with s3a -
 Spark 3.2.1
 Hadoop (Hadoop-aws) 3.3.2
 AWS SDK 1.11.655
 
   
   > 2. the azure abfs and gcs connectors do async prefetching of the next 
block, but are simply assuming that code will read sequentially; if there is 
another seek/readFully to a new location, those prefetches will be abandoned. 
there is work in s3a to do prefetching here with caching, so as to reduce the 
penalty of backwards seeks. https://issues.apache.org/jira/browse/HADOOP-18028
   
   I haven't worked with abfs or gcs. If the connectors do async pre-fetching, 
that would be great. Essentially,  the time the Parquet reader would have to 
block in the file system API would reduce substantially. In such a case, we 
could turn the async reader on/off  and rerun the benchmark to compare. From 
past experience with the MaprFS which had very aggressive read ahead in its 
hdfs client, I would still expect better parquet speeds. The fact that the 
prefetch is turned off when a seek occurs is usual behaviour, but we may see no 
benefit from the connector in that case. So a combination of async reader and 
async connector might end up being a great solution (maybe at a slightly 
greater CPU utilization). We would still have to do a benchmark to see the real 
effect.
   The async version in this PR takes care of the sequential read requirement 
by a) opening a new stream for each column and ensuring every column is read 
sequentially. Footers are read using a separate stream. Except for the footer, 
no other stream ever seeks to a new location. b) The amount of data to be read 
is predetermined so there is never a read ahead that is discarded.
   
   > 
   > hadoop is adding a vectored IO api intended for libraries like orc and 
parquet to be able to use, where the application provides an unordered list of 
ranges, a bytebuffer supplier and gets back a list of futures to wait for. the 
base implementation simply reads using readFully APi. s3a (and later abfs) will 
do full async retrieval itself, using the http connection pool. 
https://issues.apache.org/jira/browse/HADOOP-18103
   > 
   > both vectored io and s3a prefetching will ship this summer in hadoop 
3.4.0. i don't see this change conflicting with this, though they may obsolete 
a lot of it.
   
   Yes, I became aware of this recently. I'm discussing integration of these 
efforts in a separate channel. At the moment I see no conflict, but have yet to 
determine how much of this async work would need to be changed. I suspect we 
may be able to eliminate or vastly simplify `AsyncMultiBufferInputStream`. 
   
   > have you benchmarked this change with abfs or google gcs connectors to see 
what difference it makes there?
   
   No I have not. Would love help from anyone in the community with access to 
these. I only have access to S3.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130698923

   @theosib-amazon I applied my PR on top of your PR, ran thru some tests using 
Spark, and hit no issues. (All unit tests passed as well).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130327546

   > @parthchandra One thing that confuses me a bit is that these buffers have 
only ByteBuffer inside them. There's no actual I/O, so it's not possible to 
block. Do you have subclasses that provide some sort of access to real I/O?
   
   Good point. `MultiBufferInputStream` is constructed using buffers that have 
been filled already. `AsyncMultiBufferInputStream` takes an input stream as a 
parameter in the constructor and performs the IO itself. In 
`ByteBufferInputStream` I added
   ```
 public static ByteBufferInputStream wrapAsync(ExecutorService threadPool, 
SeekableInputStream fileInputStream,
   List buffers) {
   return new AsyncMultiBufferInputStream(threadPool, fileInputStream, 
buffers);
 }
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130270383

   
   > @parthchandra Would you mind having a look at my I/O performance 
optimization plan for ParquetMR? I think we should coordinate, since we have 
some ideas that might overlap what we touch.
   > 
https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing
   
   @theosib-amazon I read your document and went thru  #960. It looks like for 
the most part, #960 and this PR and complement each other.  The overlap I see 
is in the changes to `MultiBufferInputStream` where you have added the 
`readFully`, and `skipFully` APIs. The bulk of my changes for async IO are in a 
class derived from `MultiBufferInputStream` and the heart of the changes 
depends on overriding `MultiBufferInputStream.nextBuffer`. In 
`MultiBufferInputStream.nextBuffer` the assumption is that all the buffers have 
been read into. In `AsyncMultiBufferInputStream.nextBuffer` this assumption is 
removed and the call *blocks* only if the next required buffer has not been 
read into.
   Now, `skipFully` and `readFully` are potentially blocking calls because both 
call `nextBuffer` repeatedly if necessary. To gain  maximum pipelining, you 
want to make calls to skipFully and readFully such that you never block for too 
long (or at all) in the call. You will get this if you are skipping or reading 
less than the number of bytes in a single buffer. This is generally the case as 
decompression and decoding is at the page level and that is smaller than the 
size of a single buffer. However, for your optimizations, you should be aware 
of this behaviour.
   From what I see, I don't think there will be a conflict.  I'll pull in your 
PR and give it a deeper look. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-18 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130229884

   > Great effort! WIll have a look after the build succeed.
   @shangxinli I have no idea how to get the failed CI to pass. These  failures 
appear to be in unrelated areas caused by some infra issues. Is there a way to 
trigger a rerun? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-17 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1129363998

   I have some numbers from an internal benchmark using Spark. I didn't see any 
benchmarks in the Parquet codebase that I could reuse.
   
   Here are the numbers from my own benchmark - 
   
- 10 runs, each run reads all columns from store_sales (the largest table) 
in the TPC-DS (100G) dataset
 `spark.sql("select * from store_sales")`
 - Sync reader with default 8MB buffer size, Async reader with 1MB buffer 
size (achieves better pipelining)
 - Run on Macbook Pro, reading from S3. Spark has 6 cores.
 - All times in seconds
  
 | Run | Async | Sync | Async (w/o outliers)| Sync (w/o outliers) |
 | ---:| ---:| ---:| ---:| ---:| 
   |1| 84| 102| - | - | 
   |2| 90| 366| 90| 366|
   |3| 78| 156| - | 156|
   |4| 84| 128| 84| - |
   |5| 108|402| - | - | 
   |6| 90| 432| 90| - |
   |7| 84| 378| 84| 378|
   |8| 108|324| - | 324|
   |9| 90| 318| 90| 318|
   |10|90| 282| 90| 282|
   |Average| 90.6|  288.8| 88| 304|
   |Median| 90| 321| **90**| **321**|
   |StdDev| 9.98| 119.
   
   
   After removing the two highest and two lowest runs for each case, and taking 
the median value:
   
   Async: 90 sec
   Sync: 321 sec
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2022-05-17 Thread GitBox


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1129106516

   Anyone know why the CI checks are failing with a SocketTimeout exception, 
and what to do to address this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org