[GitHub] [hudi] umehrot2 commented on pull request #1702: Bootstrap datasource changes

GitBox Tue, 30 Jun 2020 03:47:58 -0700


umehrot2 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-651716297



   > Hi @umehrot2 , very clean work 👍 ! I walked through this PR and found some 
common places we can share.
   > 
   >     * Path filtering.
   > 
   >     * User input paths handling and blob pattern.
   > 
   >     * Schema provider.
   > 
   > 
   > I have a few questions.
   > 
   > How should we define the user interface?
   > Soon, we will have Bootstrap view, read optimized view, snapshot(realtime) 
view, incremental view. I am wondering we should unified the query interface 
and handle all the file formats internally. How about this:
   > Snapshot view: Bootstrap files + non-hudi files + hudi files + hudi log
   > Read optimized: Bootstrap files + non-hudi files + hudi files
   > Incremental: incremental view on top of snapshot
   > 
   > How should we split the filegroups?
   > Right now we already have 4 different filegroups. Once we add ORC support, 
there will be more. One of the cleanest ways I could find is to read each 
filegroup into RDD independently then union them together. In the current 
version of this PR, we handle regular parquet in `HudiBootstrapRDD`. The two 
disadvantages I could see:
   > 
   >     * After we add ORC support, the complexity of this RDD would increase 
if we handle the ORC reading here too.
   > 
   >     * IIUC, we didn't take the full advantage of the vectorized reader by 
using `ColumnBatch` directly. Merging probably requires reading row by row, but 
for regular parquet files, we can use the default parquet reader.
   > 
   > 
   > If we can find a way to efficiently listing files in the driver, I think 
we can separate the bootstrap files from regular parquet and only use the 
`BootstrapRDD` to handle the files that need to be merged. Happy to discuss 
more here.
   
   Thanks @garyli1019 for your review and bringing some interesting points.
   
   Yes, I think the pieces you mentioned can be used by you later for the MOR 
datasource work.
   
   Regarding the user interface for the query your proposal makes sense to me 
in general. We can may be have it flushed out in more detail once our PRs are 
merged and happy to collaborate on that.
   
   Regarding your suggestion about using `sparks regular parquet reader` for 
regular hudi files and doing a `union` with bootstrapped files:
   - Complexity after ORC comes in: The current implementation is not very 
tightly coupled with parquet. IIUC for this implementation it should just be 
matter of initializing the readers with OrcFileFormat instead of 
ParquetFileFormat which shouldn't make life difficult. Happy to hear your 
thoughts.
   
   - Full advantage of Vectorized Reader: I think I answered this in another 
comment you posted. At this point I need to do more research and gather 
datapoints if it is not utilizing `100%` of the advantages of `vectorized 
reading`. What I know for sure that the data from the file is read in a batch. 
Now, if I am loosing some performance in doing a row iteration over that batch 
I am not sure. But I believe spark regular readers, must be doing the batch to 
row conversion at some point of time. If you have more details on how spark 
does this, do let me know as it will be of great help. I will do some more 
research on this as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on pull request #1702: Bootstrap datasource changes

Reply via email to