[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1506058505 FWIW the hadoop 3.3.5 vector io changes might make this PR redundant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1483892749 > @parthchandra Do you have time to resolve the conflicts? I think it would be nice to be included in the next release. Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1482127933 > Sorry to interrupt, just wondering if this PR can work with [S3AsyncClient](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/S3AsyncClient.html) by any chance? @parthchandra > > Thanks! This PR uses the hdfs interface to s3 (s3a) if the url is s3a. I don't think that the s3a implementation uses any of the implementations of S3AsyncClient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1446796605 Also try a query like ``` select SUM(length(IFNULL(ss_sold_date_sk, ' '))), SUM(length(IFNULL(ss_sold_time_sk, ' '))), SUM(length(IFNULL(ss_item_sk, ' '))), SUM(length(IFNULL(ss_customer_sk, ' '))), SUM(length(IFNULL(ss_cdemo_sk, ' '))), SUM(length(IFNULL(ss_hdemo_sk, ' '))), SUM(length(IFNULL(ss_addr_sk, ' '))), SUM(length(IFNULL(ss_store_sk, ' '))), SUM(length(IFNULL(ss_promo_sk, ' '))), SUM(length(IFNULL(ss_ticket_number, ' '))), SUM(ss_quantity), SUM(ss_wholesale_Cost), SUM(ss_list_price ), SUM(ss_sales_price), SUM(ss_ext_discount_amt), SUM(ss_ext_sales_price), SUM(ss_ext_wholesale_cost), SUM(ss_ext_list_price), SUM(ss_ext_tax), SUM(ss_coupon_amt), SUM(ss_net_paid), SUM(ss_net_paid_inc_tax), SUM(ss_net_profit) from store_sales ``` which avoids the expensive sort -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1445230916 Looks correct to me. Couple of questions, are you running this on a cluster or on local system? Also, is the data on SSD's? If you are on a single machine, there might not be enough CPU and the async threads may be contending for time with the processing threads. We'll need some profiling info to get a better diagnosis. Also, with SSD's, reads are so fast from the file system that the async feature might show very little improvement. You could turn on the debug logging level for FilePageReader and AsyncMultibufferInputStream. We can continue this in a thread outside this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1444106912 > Hi, I am very interested in this optimization and just have some questions when testing in a cluster with 4nodes/96 cores using spark3.1. Unfortunately, I see little improvement. You're likely to see improvement in cases where file i/o is the bottleneck. Most TPC-DS queries are join heavy and you will see little improvement there. You might do better with TPC-H. > I am confused than whether it is necessary to keep spark.sql.parquet.enableVectorizedReader = false in spark when testing with spark 3.1 and how can I set the parquet buffer size. It's probably best to keep the parquet (read) buffer size untouched. You should keep `spark.sql.parquet.enableVectorizedReader = true` irrespective of this. This feature improves I/O speed of reading raw data. The Spark vectorized reader kicks in after data is read from storage and converts the raw data into Spark's internal columnar representation and is faster than the row based version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1327957394 @wgtmac Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1325768325 Updated. The thread pools are now set thru ParquetReadOptions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1322881480 Let me verify that merging the thread pools is not dangerous and post an update. After that if be great if you can take it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1319411064 > > IMO, switching `ioThreadPool` and `processThreadPool` the reader instance level will make it more flexible. > > I've changed the thread pool so that it is not initialized by default but I left them as static members. Ideally, there should be a single IO thread pool that handles all the IO for a process and the size of the pool is determined by the bandwidthof the underlying storage system. Making them per instance is not an issue though. The calling code can decide to set the same thread pool for all instances and achieve the same result. Let me update this. > > Also, any changes you want to make are fine with me, and the help is certainly appreciated ! I'm thinking of merging the thread pools into a single ioThreadPool and making the ioThreadPool settable thru `ParquetReadOptions` (like the allocator is). The work being done by the processThreadPool is rather small and maybe we can do away with it. Adding the pool via `ParquetReadOptions` makes it easier to use with `ParquetReader` (used a lot in unit tests). WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1319040697 @wgtmac thank you for looking at this. I don't have any more TODOs on this PR. > * Adopt the incoming Hadoop vectored io api. This should be part of another PR. There is a draft PR (#999 ) open for this. Once that is merged in, I can revisit the async I/O code and incorporate the vectored io api. In other experiments I have seen that async io gives better results over slower networks. With faster network connections, as is the case where we are reading from S3 within an AWS environment, reading in parallel (as the vector io api does), starts to give better results. I believe, that both should be available as options. > * Benchmark against remote object stores from different cloud providers. The numbers I posted earlier were for reading from AWS/S3 over a 1 Gbps line. Reading from within AWS shows lesser improvement. I don't have an account with other cloud providers. Any help here would be appreciated. > IMO, switching `ioThreadPool` and `processThreadPool` the reader instance level will make it more flexible. I've changed the thread pool so that it is not initialized by default but I left them as static members. Ideally, there should be a single IO thread pool that handles all the IO for a process and the size of the pool is determined by the bandwidthof the underlying storage system. Making them per instance is not an issue though. The calling code can decide to set the same thread pool for all instances and achieve the same result. Let me update this. Also, any changes you want to make are fine with me, and the help is certainly appreciated ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1160693399 @shangxinli Thank you for the review! I'll address these comments asap. I am reviewing the thread pool and its initialization. IMO, it is better if there is no default initialization of the pool and the calling application/framework does so explicitly. One side effect of the default initialization is that the pool is created unnecessarily even if async is off. Also, if an application, shades and includes another copy of the library (or transitively, many more), then one more thread pool gets created for every version of the library included. It is probably a better idea to allow the thread pool to be assigned as a per instance variable. The calling application can then decide to use a single pool for all instances or a new one per instance whichever use case is better for their performance. Finally, some large scale testing has revealed a possible resource leak. I'm looking into addressing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136552664 > This is interesting, because when I did profiling of Trino, I found that although I/O (from S3, over the network no less) was significant, even more time was spent in compute. Maybe you're getting improved performance because you're increasing _parallelism_ between I/O and compute. It may be because I was using Spark's vectorized parquet decoding which is an order or magnitude faster than parquet library's row by row decoding (see [Spark benchmarks](https://github.com/apache/spark/blob/master/sql/core/benchmarks/DataSourceReadBenchmark-results.txt)). If trino is not doing vectorized decoding (I took a very quick look and I don't think it is), I would suggest you can look into that next. All the cool kids are doing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136542868 > So it sounds like you're optimizing one layer of processing, and I'm optimizing the next layer up, and it's kindof a coincidence that we're touching some of the same classes just because code reuse has been possible here. Yeah, kind of cool :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136531385 BTW, adding more tests for the InputStream implementations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136528026 > Latency is the killer; in an HTTP request you want read enough but not discard data or break an http connection if the client suddenly does a seek() or readFully() somewhere else. file listings, existence checks etc. > > That'd be great. now, if you could also handle requesting different columns in parallel and processing them out of order. I do. The Parquet file reader api that reads row groups in sync mode reads all columns in sequence. In async mode, it fires off a task for every column blocking only to read the first page of every column before returning. This part also uses a different thread pool from the IO tasks so that IO tasks never wait because there are no available threads in the thread pool. > > be good to think about vectored IO. I think I know how to integrate this PR with the vectored IO, but this is only after a cursory look. > > and yes, updating parquet dependencies would be good, hadoop 3.3.0 should be the baseline. Who can drive this (presumably) non-trivial change? I myself have no karma points :( > just sketched out my thoughts on this. I've played with some of this in my own branch. I think the next step would be for me to look at the benchmark code to make it targetable elsewhere. > > https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/ This is great. I now have much more context of where you are coming from (and going to) ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136439133 > Is byte (and arrays and buffers of bytes) the only datatype you support? My PR is optimizing code paths that pull ints, longs, and other sizes out of the data buffers. Are those not necessary for any of the situations where you're using an async buffer? The input stream API is generally unaware of the datatypes of its contents and so those are the only apis I use. The other reason is that the ParquetFileReader returns Pages which basically contain metadata and ByteBuffers of _compressed_ data. The decompression and decoding into types comes much later in a downstream thread. For your PR, I don't think the AsyncMultibufferInputStream is every going to be in play in the paths you're optimizing. But just in case it is, your type aware methods will work as is because AsyncMultibufferInputStream is derived from MultiBufferInputStream and will inherit those methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1136427603 > thanks., that means you are current with all shipping improvments. the main one extra is to use openFile(), passing in length and requesting randomio. this guarantees ranged GET requests and cuts the initial HEAD probe for existence/size of file. By `openFile()` do you mean `FileSystem.openFileWithOptions(Path,OpenFileParameters)`? While looking I realized the Parquet builds with a [much older version of hadoop](https://github.com/apache/parquet-mr/blob/a2da156b251d13bce1fa81eb95b555da04880bc1/pom.xml#L79) > > > have you benchmarked this change with abfs or google gcs connectors to see what difference it makes there? > > > No I have not. Would love help from anyone in the community with access to these. I only have access to S3. > > that I have. FWIW, with the right tuning of abfs prefetch (4 threads, 128 MB blocks) i can get full FTTH link rate from a remote store; 700 mbit/s . that's to the base station. once you add wifi the bottlenecks move. Wow! That is nearly as fast as local HDD. At this point the bottlenecks in parquet begin to move towards decompression and decoding but IO remains the slowest link in the chain. One thing we get with my PR is that the ParquetFileReader had assumptions built in that all data must be read before downstream can proceed. Some of my changes are related to removing these assumptions and ensuring that downstream processing does not block until an entire column is read so we get efficient pipelining. What does the 128 MB block mean? Is this the amount prefetched for a stream? The read API does not block until the entire block is filled, I presume. With my PR, parquet IO is reading 8MB at a time (default) and downstream is processing 1MB at a time (default) and several such streams (one per column) are in progress at the same time. Hopefully, this read pattern would work with the prefetch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1135366352 @steveloughran thank you very much for taking the time to review and provide feedback! > 1. whose s3 client was used for testing here -if the s3a one, which hadoop release? I was working with s3a - Spark 3.2.1 Hadoop (Hadoop-aws) 3.3.2 AWS SDK 1.11.655 > 2. the azure abfs and gcs connectors do async prefetching of the next block, but are simply assuming that code will read sequentially; if there is another seek/readFully to a new location, those prefetches will be abandoned. there is work in s3a to do prefetching here with caching, so as to reduce the penalty of backwards seeks. https://issues.apache.org/jira/browse/HADOOP-18028 I haven't worked with abfs or gcs. If the connectors do async pre-fetching, that would be great. Essentially, the time the Parquet reader would have to block in the file system API would reduce substantially. In such a case, we could turn the async reader on/off and rerun the benchmark to compare. From past experience with the MaprFS which had very aggressive read ahead in its hdfs client, I would still expect better parquet speeds. The fact that the prefetch is turned off when a seek occurs is usual behaviour, but we may see no benefit from the connector in that case. So a combination of async reader and async connector might end up being a great solution (maybe at a slightly greater CPU utilization). We would still have to do a benchmark to see the real effect. The async version in this PR takes care of the sequential read requirement by a) opening a new stream for each column and ensuring every column is read sequentially. Footers are read using a separate stream. Except for the footer, no other stream ever seeks to a new location. b) The amount of data to be read is predetermined so there is never a read ahead that is discarded. > > hadoop is adding a vectored IO api intended for libraries like orc and parquet to be able to use, where the application provides an unordered list of ranges, a bytebuffer supplier and gets back a list of futures to wait for. the base implementation simply reads using readFully APi. s3a (and later abfs) will do full async retrieval itself, using the http connection pool. https://issues.apache.org/jira/browse/HADOOP-18103 > > both vectored io and s3a prefetching will ship this summer in hadoop 3.4.0. i don't see this change conflicting with this, though they may obsolete a lot of it. Yes, I became aware of this recently. I'm discussing integration of these efforts in a separate channel. At the moment I see no conflict, but have yet to determine how much of this async work would need to be changed. I suspect we may be able to eliminate or vastly simplify `AsyncMultiBufferInputStream`. > have you benchmarked this change with abfs or google gcs connectors to see what difference it makes there? No I have not. Would love help from anyone in the community with access to these. I only have access to S3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130698923 @theosib-amazon I applied my PR on top of your PR, ran thru some tests using Spark, and hit no issues. (All unit tests passed as well). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130327546 > @parthchandra One thing that confuses me a bit is that these buffers have only ByteBuffer inside them. There's no actual I/O, so it's not possible to block. Do you have subclasses that provide some sort of access to real I/O? Good point. `MultiBufferInputStream` is constructed using buffers that have been filled already. `AsyncMultiBufferInputStream` takes an input stream as a parameter in the constructor and performs the IO itself. In `ByteBufferInputStream` I added ``` public static ByteBufferInputStream wrapAsync(ExecutorService threadPool, SeekableInputStream fileInputStream, List buffers) { return new AsyncMultiBufferInputStream(threadPool, fileInputStream, buffers); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130270383 > @parthchandra Would you mind having a look at my I/O performance optimization plan for ParquetMR? I think we should coordinate, since we have some ideas that might overlap what we touch. > https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing @theosib-amazon I read your document and went thru #960. It looks like for the most part, #960 and this PR and complement each other. The overlap I see is in the changes to `MultiBufferInputStream` where you have added the `readFully`, and `skipFully` APIs. The bulk of my changes for async IO are in a class derived from `MultiBufferInputStream` and the heart of the changes depends on overriding `MultiBufferInputStream.nextBuffer`. In `MultiBufferInputStream.nextBuffer` the assumption is that all the buffers have been read into. In `AsyncMultiBufferInputStream.nextBuffer` this assumption is removed and the call *blocks* only if the next required buffer has not been read into. Now, `skipFully` and `readFully` are potentially blocking calls because both call `nextBuffer` repeatedly if necessary. To gain maximum pipelining, you want to make calls to skipFully and readFully such that you never block for too long (or at all) in the call. You will get this if you are skipping or reading less than the number of bytes in a single buffer. This is generally the case as decompression and decoding is at the page level and that is smaller than the size of a single buffer. However, for your optimizations, you should be aware of this behaviour. From what I see, I don't think there will be a conflict. I'll pull in your PR and give it a deeper look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1130229884 > Great effort! WIll have a look after the build succeed. @shangxinli I have no idea how to get the failed CI to pass. These failures appear to be in unrelated areas caused by some infra issues. Is there a way to trigger a rerun? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1129363998 I have some numbers from an internal benchmark using Spark. I didn't see any benchmarks in the Parquet codebase that I could reuse. Here are the numbers from my own benchmark - - 10 runs, each run reads all columns from store_sales (the largest table) in the TPC-DS (100G) dataset `spark.sql("select * from store_sales")` - Sync reader with default 8MB buffer size, Async reader with 1MB buffer size (achieves better pipelining) - Run on Macbook Pro, reading from S3. Spark has 6 cores. - All times in seconds | Run | Async | Sync | Async (w/o outliers)| Sync (w/o outliers) | | ---:| ---:| ---:| ---:| ---:| |1| 84| 102| - | - | |2| 90| 366| 90| 366| |3| 78| 156| - | 156| |4| 84| 128| 84| - | |5| 108|402| - | - | |6| 90| 432| 90| - | |7| 84| 378| 84| 378| |8| 108|324| - | 324| |9| 90| 318| 90| 318| |10|90| 282| 90| 282| |Average| 90.6| 288.8| 88| 304| |Median| 90| 321| **90**| **321**| |StdDev| 9.98| 119. After removing the two highest and two lowest runs for each case, and taking the median value: Async: 90 sec Sync: 321 sec -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader
parthchandra commented on PR #968: URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1129106516 Anyone know why the CI checks are failing with a SocketTimeout exception, and what to do to address this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org