Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Ryan Blue Tue, 19 Mar 2019 14:24:32 -0700

Sandeep, we use the `TableOperations` API to plug in other metastores, as
well as other customizations like `FileIO` implementations. Thanks to Matt
and Yifei for their work extending this area!


On Mon, Mar 18, 2019 at 8:49 PM Sandeep Nayak <osgig...@gmail.com> wrote:

> To Xabriel's point, it would be good to have a Store abstraction so that
> one could plug-in an implementation be it HMS or something else.
>
>
> On Mon, Mar 18, 2019 at 3:20 PM Xabriel Collazo Mojica
> <xcoll...@adobe.com.invalid> wrote:
>
>> +1 for having a tool/API to migrate tables from HMS into Iceberg.
>>
>>
>>
>> We do not use HMS in my current project, but since HMS is the de facto
>> catalog in most companies doing Hadoop, I think such a tool would be vital
>> for incentivizing Iceberg adoption and/or PoCs.
>>
>>
>>
>> *Xabriel J Collazo Mojica*  |  Senior Software Engineer  |  Adobe  |
>> xcoll...@adobe.com
>>
>>
>>
>> *From: *<aokolnyc...@apple.com> on behalf of Anton Okolnychyi
>> <aokolnyc...@apple.com.INVALID>
>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>> *Date: *Monday, March 18, 2019 at 2:22 PM
>> *To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, Ryan Blue <
>> rb...@netflix.com>
>> *Subject: *Re: Extend SparkTableUtil to Handle Tables Not Tracked in
>> Hive Metastore
>>
>>
>>
>> I definitely support this idea. Having a clean and reliable API to
>> migrate existing Spark tables to Iceberg will be helpful.
>>
>> I propose to collect all requirements for the new API in this thread.
>> Then I can come up with a doc that we will discuss within the community.
>>
>>
>>
>> From the feature perspective, I think it would be important to support
>> tables that persist partition information in HMS as well as tables that
>> derive partition information from the folder structure. Also, migrating
>> just a partition of a table would be useful.
>>
>>
>>
>>
>>
>> On 18 Mar 2019, at 18:28, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>
>>
>>
>> I think that would be fine, but I want to throw out a quick warning:
>> SparkTableUtil was initially written as a few handy helpers, so it wasn't
>> well designed as an API. It's really useful, so I can understand wanting to
>> extend it. But should we come up with a real API for these conversion tasks
>> instead of updating the hacks?
>>
>>
>>
>> On Mon, Mar 18, 2019 at 11:11 AM Anton Okolnychyi <
>> aokolnyc...@apple.com.invalid> wrote:
>>
>> Hi,
>>
>> SparkTableUtil can be helpful for migrating existing Spark tables into
>> Iceberg. Right now, SparkTableUtil assumes that the partition information
>> is always tracked in Hive metastore.
>>
>> What about extending SparkTableUtil to handle Spark tables that don’t
>> rely on Hive metastore? I have a local prototype that makes use of Spark
>> InMemoryFileIndex to infer the partitioning info.
>>
>> Thanks,
>> Anton
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to