Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Anton Okolnychyi Tue, 14 May 2019 03:22:07 -0700

I would like to resume this topic. How do we see the proper API for migration?


I have a couple of questions in mind:
- Now, it is based on a Spark job. Do we want to keep it that way because the 
number of files might be huge? Will it be too much for the driver?
- What should be the scope? Do we want the migration to modify the catalog? 
- If we stick to having Dataset[DataFiles], we will have one append per dataset 
partition. How do we want to guarantee that either all files are appended or 
none? One way is to rely on temporary tables but this requires looking into the 
catalog. Another approach is to say that if the migration isn’t successful, 
users will need to remove the metadata and retry.

Any ideas would be welcome.

Thanks,
Anton

> On 20 Mar 2019, at 16:27, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> We can always consider moving interfaces from core to the API when they are 
> relatively stable. For SPI classes, I'm not sure there's a good reason to do 
> it because the API and SPI are purposely different. API is for using Iceberg 
> from a processing engine or data management service. SPI is for customizing 
> how the core table implementations work. Maybe we could move SPI to its own 
> module, eventually.
> 
> On Tue, Mar 19, 2019 at 8:36 PM Sandeep Nayak <osgig...@gmail.com 
> <mailto:osgig...@gmail.com>> wrote:
> Thanks for pointing me in the right direction Ryan.
> 
> One question comes to mind, the `TableOperations` is in the core project but 
> the interface is identified as an SPI. If one were to plug in other 
> metastores should we consider restricting any such extensions to be only 
> against classes in api or maybe a separate spi  sub-project v/s core 
> sub-project?  That pattern would allow for restricting dependencies in one 
> isolated area and allowing other areas (like core) to evolve a bit more 
> freely without worrying about breaking any dependents. 
> 
> Thoughts?
> 
> -Sandeep
> 
> On Tue, Mar 19, 2019 at 2:24 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Sandeep, we use the `TableOperations` API to plug in other metastores, as 
> well as other customizations like `FileIO` implementations. Thanks to Matt 
> and Yifei for their work extending this area!
> 
> On Mon, Mar 18, 2019 at 8:49 PM Sandeep Nayak <osgig...@gmail.com 
> <mailto:osgig...@gmail.com>> wrote:
> To Xabriel's point, it would be good to have a Store abstraction so that one 
> could plug-in an implementation be it HMS or something else.
> 
> 
> On Mon, Mar 18, 2019 at 3:20 PM Xabriel Collazo Mojica 
> <xcoll...@adobe.com.invalid> wrote:
> +1 for having a tool/API to migrate tables from HMS into Iceberg.
> 
>  
> 
> We do not use HMS in my current project, but since HMS is the de facto 
> catalog in most companies doing Hadoop, I think such a tool would be vital 
> for incentivizing Iceberg adoption and/or PoCs.
> 
>  
> 
> Xabriel J Collazo Mojica  |  Senior Software Engineer  |  Adobe  |  
> xcoll...@adobe.com <mailto:xcoll...@adobe.com>
>  
> 
> From: <aokolnyc...@apple.com <mailto:aokolnyc...@apple.com>> on behalf of 
> Anton Okolnychyi <aokolnyc...@apple.com.INVALID>
> Reply-To: "dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>" 
> <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>>
> Date: Monday, March 18, 2019 at 2:22 PM
> To: "dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>" 
> <dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>>, Ryan Blue 
> <rb...@netflix.com <mailto:rb...@netflix.com>>
> Subject: Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive 
> Metastore
> 
>  
> 
> I definitely support this idea. Having a clean and reliable API to migrate 
> existing Spark tables to Iceberg will be helpful.
> 
> I propose to collect all requirements for the new API in this thread. Then I 
> can come up with a doc that we will discuss within the community.
> 
>  
> 
> From the feature perspective, I think it would be important to support tables 
> that persist partition information in HMS as well as tables that derive 
> partition information from the folder structure. Also, migrating just a 
> partition of a table would be useful. 
> 
>  
> 
> 
> 
> 
> On 18 Mar 2019, at 18:28, Ryan Blue <rb...@netflix.com.INVALID 
> <mailto:rb...@netflix.com.INVALID>> wrote:
> 
>  
> 
> I think that would be fine, but I want to throw out a quick warning: 
> SparkTableUtil was initially written as a few handy helpers, so it wasn't 
> well designed as an API. It's really useful, so I can understand wanting to 
> extend it. But should we come up with a real API for these conversion tasks 
> instead of updating the hacks?
> 
>  
> 
> On Mon, Mar 18, 2019 at 11:11 AM Anton Okolnychyi 
> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote:
> 
> Hi,
> 
> SparkTableUtil can be helpful for migrating existing Spark tables into 
> Iceberg. Right now, SparkTableUtil assumes that the partition information is 
> always tracked in Hive metastore.
> 
> What about extending SparkTableUtil to handle Spark tables that don’t rely on 
> Hive metastore? I have a local prototype that makes use of Spark 
> InMemoryFileIndex to infer the partitioning info.
> 
> Thanks,
> Anton
> 
> 
> 
>  
> 
> --
> 
> Ryan Blue
> 
> Software Engineer
> 
> Netflix
> 
>  
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to