umehrot2 commented on pull request #1702: URL: https://github.com/apache/hudi/pull/1702#issuecomment-651716297
> Hi @umehrot2 , very clean work 👍 ! I walked through this PR and found some common places we can share. > > * Path filtering. > > * User input paths handling and blob pattern. > > * Schema provider. > > > I have a few questions. > > How should we define the user interface? > Soon, we will have Bootstrap view, read optimized view, snapshot(realtime) view, incremental view. I am wondering we should unified the query interface and handle all the file formats internally. How about this: > Snapshot view: Bootstrap files + non-hudi files + hudi files + hudi log > Read optimized: Bootstrap files + non-hudi files + hudi files > Incremental: incremental view on top of snapshot > > How should we split the filegroups? > Right now we already have 4 different filegroups. Once we add ORC support, there will be more. One of the cleanest ways I could find is to read each filegroup into RDD independently then union them together. In the current version of this PR, we handle regular parquet in `HudiBootstrapRDD`. The two disadvantages I could see: > > * After we add ORC support, the complexity of this RDD would increase if we handle the ORC reading here too. > > * IIUC, we didn't take the full advantage of the vectorized reader by using `ColumnBatch` directly. Merging probably requires reading row by row, but for regular parquet files, we can use the default parquet reader. > > > If we can find a way to efficiently listing files in the driver, I think we can separate the bootstrap files from regular parquet and only use the `BootstrapRDD` to handle the files that need to be merged. Happy to discuss more here. Thanks @garyli1019 for your review and bringing some interesting points. Yes, I think the pieces you mentioned can be used by you later for the MOR datasource work. Regarding the user interface for the query your proposal makes sense to me in general. We can may be have it flushed out in more detail once our PRs are merged and happy to collaborate on that. Regarding your suggestion about using `sparks regular parquet reader` for regular hudi files and doing a `union` with bootstrapped files: - Complexity after ORC comes in: The current implementation is not very tightly coupled with parquet. IIUC for this implementation it should just be matter of initializing the readers with OrcFileFormat instead of ParquetFileFormat which shouldn't make life difficult. Happy to hear your thoughts. - Full advantage of Vectorized Reader: I think I answered this in another comment you posted. At this point I need to do more research and gather datapoints if it is not utilizing `100%` of the advantages of `vectorized reading`. What I know for sure that the data from the file is read in a batch. Now, if I am loosing some performance in doing a row iteration over that batch I am not sure. But I believe spark regular readers, must be doing the batch to row conversion at some point of time. If you have more details on how spark does this, do let me know as it will be of great help. I will do some more research on this as well. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
