[GitHub] [iceberg] aokolnychyi edited a comment on issue #1422: Distributing Job Planning on Spark

GitBox Fri, 04 Sep 2020 16:28:26 -0700


aokolnychyi edited a comment on issue #1422:
URL: https://github.com/apache/iceberg/issues/1422#issuecomment-687464819



   I haven't formed an opinion on what approach is more promising right now. 
Here are my thoughts after a quick look:
   
   **Option 1 (extension to the `TableScan` API)**
   
   Pros:
   - Reuses the existing filtering logic.
   - API could be reused by other query engines.
   - Seems to require less code?
   
   Cons:
   - Is a substantial change to the core planning logic that requires thorough 
testing (both performance and correctness).
   - Requires to think about serialization and especially Kryo serialization 
during planning (was not needed before).
   
   **Option 2 (metadata tables)**
   
   Pros:
   
   - Reuses the existing logic to read manifests in a scalable way via metadata 
tables.
   - Reuses the existing logic for wrapping Spark `Row`s into Iceberg 
`DataFile`s.
   - Doesn't touch the core planning API and is more isolated.
   - Maybe, can be exposed as an action (makes sense or not?)
   
   Cons:
   
   - Requires instantiating evaluators ourselves.
   - Seems to require a bit more code but I feel like it can be simplified.
   - Specific to Spark but could be implemented in other systems that support 
metadata tables.
   
   Implementation aside, we need to consider when to apply this. Ideally, we 
would have some sort of a threshold for the size of manifests we need to scan 
after manifest filtering. If we narrow down the scope to a couple of manifests, 
plan locally. Otherwise, plan using Spark. I am not sure it will be that easy, 
though. Instead, we could support a Spark read option that would tell us which 
mode to use. The value can be local, distributed, auto. In auto, the simplest 
option is to analyze the scan condition. If there is a reasonable partition 
predicate (e.g. `equals` or `inSet`), we could always do planning locally. If 
not and if distributed is enabled, leverage Spark.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi edited a comment on issue #1422: Distributing Job Planning on Spark

Reply via email to