Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Gabor Kaszab Tue, 30 Jul 2024 00:44:15 -0700

Hey Iceberg Community,

Sorry, for being late to this conversation. I just wanted to share that I'm
against deprecating HadoopCatalog or moving it to tests. Currently Impala
relies heavily on HadoopCatalog for it's own tests and I personally find
HadoopCatalog pretty handy when I just want to do some cross-engine
experiments where my data is already on HDFS and I just write a table with
engineA and see if engineB can read it and I don't want to bother with
setting up any services to serve as an Iceberg catalog (HMS for instance).


I believe that even though we don't consider HadoopCatalog a production
grade solution as it is now, it has its benefits for lightweight
experimentation.

   - I'm +1 for keeping HadoopCatalog
   - We should emphasize that HDFS is the desired storage for HadoopCatalog
   (can we force this in the code?)
   - Apparently, there is a part of this community that is open to add
   enhancements to HadoopCatalog to bring it closer to production gradeness
   (lisoda). I don't think we shouldn't block these contributions.
   - If we say that REST Catalog is preferred over HadoopCatalog I think
   the Iceberg project should offer its own open-source solution available for
   everyone.

Regards,
Gabor

On Thu, Jul 25, 2024 at 9:04 PM Ryan Blue <b...@databricks.com.invalid>
wrote:

> There are ways to use object store or file system features to do this, but
> there are a lot of variations. Building implementations and trying to
> standardize each one is a lot of work. And then you still get a catalog
> that doesn't support important features.
>
> I don't think that this is a good direction to build for the Iceberg
> project. But I also have no objection to someone doing it in a different
> project that uses the Iceberg metadata format.
>
> On Tue, Jul 23, 2024 at 5:57 PM lisoda <lis...@yeah.net> wrote:
>
>>
>> Sir, regarding this point, we have some experience. In my view, as long
>> as the file system supports atomic single-file writing, where the file
>> becomes immediately visible upon the client's successful write operation,
>> that is sufficient. We can do without the rename operation as long as the
>> file system guarantees this feature. Of course, if the object storage
>> system supports mutex operations, we can also uniformly use the rename
>> operation for committing. We can theoretically avoid the situation of
>> providing a large number of commit strategies for different file systems.
>> ---- Replied Message ----
>> From Jack Ye<yezhao...@gmail.com> <yezhao...@gmail.com>
>> Date 07/24/2024 02:52
>> To dev@iceberg.apache.org
>> Cc
>> Subject Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests
>> in 2.0
>> If we come up with a new storage-only catalog implementation that could
>> solve those limitations and also leverage the new features being developed
>> in object storage, would that be a potential alternative strategy? so
>> HadoopCatalog users has a way to move forward with still a storage-only
>> catalog that can run on HDFS, and we can fully deprecate HadoopCatalog.
>>
>> -Jack
>>
>> On Tue, Jul 23, 2024 at 10:00 AM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> I don't think we would want to put this in a module with other catalog
>>> implementations. It has serious limitations and is actively discouraged,
>>> while the other catalog implementations still have value as either REST
>>> back-end catalogs or as regular catalogs for many users.
>>>
>>> On Tue, Jul 23, 2024 at 9:11 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> For some additional information, we also have some Iceberg HDFS users
>>>> on EMR. Those are mainly users that have long-running Hadoop and HBase
>>>> installations. They typically refresh their installation every 1-2 years.
>>>> From my understanding, they use S3 for data storage, but metadata is kept
>>>> in the local HDFS cluster, thus HadoopCatalog works well for them.
>>>>
>>>> I remember we discussed moving all catalog implementations in the main
>>>> repo right now to a separated iceberg-catalogs repo. Could we do this move
>>>> as a part of that effort?
>>>>
>>>> -Jack
>>>>
>>>> On Tue, Jul 23, 2024 at 8:46 AM Ryan Blue <b...@databricks.com.invalid>
>>>> wrote:
>>>>
>>>>> Thanks for the context, lisoda. I agree that it's good to understand
>>>>> the issues you're facing with the HadoopCatalog. One follow up question
>>>>> that I have is what the underlying storage is. Are you using HDFS for 
>>>>> those
>>>>> 30,000 customers?
>>>>>
>>>>> I think you're right that there is a challenge to migrating. Because
>>>>> there is no catalog requirement, it's hard to make sure you have all of 
>>>>> the
>>>>> writers migrated. I think that means we do need to have a plan or
>>>>> recommendation for people currently using this catalog in production, but
>>>>> it also puts more pressure on us to deprecate this catalog and avoid more
>>>>> people having this problem.
>>>>>
>>>>> I think it's a good idea to make the spec change, which we have
>>>>> agreement for and to ensure that the FS catalog and table operations are
>>>>> properly deprecated to show that they should not be used. I'm not sure
>>>>> whether there is support in the community for moving the implementation
>>>>> into a new iceberg-hadoop module, but at a minimum we can't just remove
>>>>> this right away. I think that a separate iceberg-hadoop module would make
>>>>> the most sense.
>>>>>
>>>>> On Thu, Jul 18, 2024 at 11:09 PM lisoda <lis...@yeah.net> wrote:
>>>>>
>>>>>> Hi team.
>>>>>>      I am not a pmc member, just a regular user. Instead of
>>>>>> discussing whether hadoopcatalog needs to continue to exist, I'd like to
>>>>>> share a more practical issue.
>>>>>>
>>>>>>     We currently serve over 30,000 customers, all of whom use Iceberg
>>>>>> to store their foundational data, and all business analyses are conducted
>>>>>> based on Iceberg. However, all the Iceberg tables are hadoop_catalog. At
>>>>>> least, this has been the case since I started working with our production
>>>>>> environment system.
>>>>>>
>>>>>>     In recent days, I've attempted to migrate hadoop_catalog to
>>>>>> jdbc-catalog, but I failed. We store 2PB of data, and replacing the 
>>>>>> current
>>>>>> catalogues has become an almost impossible task. Users not only create
>>>>>> hadoop_catalog tables through Spark, they also continuously use 
>>>>>> third-party
>>>>>> OLAP systems/FLINK and other means to write data into Iceberg in the form
>>>>>> of hadoop_catalog. Given this situation, we can only continue to fix
>>>>>> hadoop_catalog and provide services to customers.
>>>>>>
>>>>>>     I understand that the community wants to make a big push into
>>>>>> rest-catalog, and I agree with the direction the community is going.But 
>>>>>> considering
>>>>>> that there might be a significant number of users facing similar issues,
>>>>>> can we at least retain a module similar to iceberg-hadoop to extend
>>>>>> hadoop_catalog? If it is removed, we won't be able to continue providing
>>>>>> services to customers. So, if possible, please consider this option.
>>>>>>
>>>>>> Thank you all.
>>>>>>
>>>>>> Kind regards,
>>>>>> lisoda
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote:
>>>>>>
>>>>>> Thank you for bringing this up Ryan. I have been also in the camp of
>>>>>> saying HadoopCatalog is not recommended, but after thinking about this 
>>>>>> more
>>>>>> deeply last night, I now have mixed feelings about this topic. Just to
>>>>>> comment on the reasons you listed first:
>>>>>>
>>>>>> * For reason 1 & 2, it looks like the root cause is that people try
>>>>>> to use HadoopCatalog outside native HDFS because there are HDFS 
>>>>>> connectors
>>>>>> to other storages like S3AFileSystem. However, the norm for such usage 
>>>>>> has
>>>>>> been that those connectors do not strictly follow HDFS semantics, and it 
>>>>>> is
>>>>>> assumed that people acknowledge the implication of such usage and accept
>>>>>> the risk. For example, S3AFileSystem was there even before S3 was 
>>>>>> strongly
>>>>>> consistent, but people have been using that to write files.
>>>>>>
>>>>>> * For reason 3, there are multiple catalogs that do not support all
>>>>>> operations (e.g. Glue for atomic table rename) and people still widely 
>>>>>> use
>>>>>> it.
>>>>>>
>>>>>> * For reason 4, I see that more as a missing feature. More features
>>>>>> could definitely be developed in that catalog implementation.
>>>>>>
>>>>>> So the key question to me is, how can we prevent people from using
>>>>>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular 
>>>>>> because
>>>>>> it is a storage only solution. For object storages specifically,
>>>>>> HadoopCatalog is not suitable for 2 reasons:
>>>>>>
>>>>>> (1) file write does not enforce mutual exclusion, thus cannot enforce
>>>>>> Iceberg optimistic concurrency requirement (a.k.a. cannot do atomic and
>>>>>> swap)
>>>>>>
>>>>>> (2) directory-based design is not preferred in object storage and
>>>>>> will result in bad performance.
>>>>>>
>>>>>> However, now I look at these 2 issues, they are getting outdated.
>>>>>>
>>>>>> (1) object storage is starting to enforce file mutual exclusion. GCS
>>>>>> supports file generation number [1] that increments monotonically, and 
>>>>>> can
>>>>>> use x-goog-if-generation-match [2] to perform atomic swap. Similar 
>>>>>> feature
>>>>>> [3] exists in Azure Blob Storage. I cannot speak for the S3 team roadmap.
>>>>>> But Amazon S3 is clearly falling behind in this domain, and with market
>>>>>> competition, it is very clear that similar features will come in 
>>>>>> reasonably
>>>>>> near future.
>>>>>>
>>>>>> (2) directory bucket is becoming the norm. Amazon S3 announced
>>>>>> directory bucket in 2023 re:invent [4], which does not have the same
>>>>>> performance limitation even if you have very nested folders and many
>>>>>> objects in a folder. GCS also has a similar feature launched in preview 
>>>>>> [5]
>>>>>> right now. Azure also already has this feature since 2021 [6].
>>>>>>
>>>>>> With these new developments in the industry, a storage-only Iceberg
>>>>>> catalog becomes very attractive. It is simple with only one service
>>>>>> dependency. It can safely perform atomic compare-and-swap. It is 
>>>>>> performant
>>>>>> without the need to worry about folder and file organization. If you want
>>>>>> to add additional features for things like access control, there are also
>>>>>> integrations like access grant [7] that can be integrated to do it in a
>>>>>> very scalable way.
>>>>>>
>>>>>> I know the direction in the community so far is to go with the REST
>>>>>> catalog, and I am personally a big advocate for that. However, that
>>>>>> requires either building a full REST catalog, or choosing a catalog 
>>>>>> vendor
>>>>>> that supports REST. There are many capabilities that REST would unlock, 
>>>>>> but
>>>>>> those are visions which I expect will take many years down the road for 
>>>>>> the
>>>>>> community to continue to drive consensus and build those features. If I 
>>>>>> am
>>>>>> the CTO of a small company and I just want an Iceberg data lake(house)
>>>>>> right now, do I choose REST, or do I choose (or even just build) a
>>>>>> storage-only Iceberg catalog? I feel I would actually choose the later.
>>>>>>
>>>>>> Going back to the discussion points, my current take of this topic is
>>>>>> that:
>>>>>>
>>>>>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS
>>>>>> in the spec.
>>>>>>
>>>>>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by
>>>>>> default (e.g. fail if using S3A), but we should allow a feature flag to
>>>>>> unblock the usage so that people can use it after understanding the
>>>>>> implications and risks, just like how people use S3A today.
>>>>>>
>>>>>> (3) +0 for removing HadoopCatalog from the core library. It could be
>>>>>> in a different module like iceberg-hdfs if that is more suitable.
>>>>>>
>>>>>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a
>>>>>> valid use case for Iceberg. After the measures 1-3 above, people actually
>>>>>> having a HDFS use case should be able to continue to innovate and 
>>>>>> optimize
>>>>>> the HadoopCatalog implementation. Although "HDFS is becoming much less
>>>>>> common", looking at GitHub issues and discussion forums, it still has a
>>>>>> pretty big user base.
>>>>>>
>>>>>> (5) In general, I propose we separate the discussion of HadoopCatalog
>>>>>> from a "storage only catalog" that also deals with other object stages 
>>>>>> when
>>>>>> evaluating it. With these latest industry developments, we should 
>>>>>> evaluate
>>>>>> the direction for building a storage only Iceberg catalog and see if the
>>>>>> community has an interest in that. I could help raise a thread about it
>>>>>> after this discussion is closed.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>> [1]
>>>>>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
>>>>>> [2]
>>>>>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
>>>>>> [3]
>>>>>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
>>>>>> [4]
>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
>>>>>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
>>>>>> [6]
>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
>>>>>> [7]
>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>
>>>>>>> +1 on deprecating now and removing them from the codebase with
>>>>>>> Iceberg 2.0
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 on deprecating the `File System Tables` from spec and
>>>>>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>>>>>>> and removing them permanently during 2.0 release.
>>>>>>>>
>>>>>>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>>>>>>
>>>>>>>> I am not sure about moving to test or keeping them only for HDFS.
>>>>>>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>>>>>>
>>>>>>>> I wanted to have it deprecated 2 years ago
>>>>>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>>>>>>> and I remember that we discussed it in sync that time and left it as 
>>>>>>>> it is.
>>>>>>>> Also, when the user brought this up in slack
>>>>>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>>>>>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>>>>>>> I have asked to open this discussion on the mailing list. So, that
>>>>>>>> we can conclude it once and for all.
>>>>>>>>
>>>>>>>> - Ajantha
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Ryan and others,
>>>>>>>>>
>>>>>>>>> Thanks for bringing this up. I would be in favor of removing the
>>>>>>>>> HadoopTableOperations, mostly because of the reasons that you already
>>>>>>>>> mentioned, but also about the fact that it is not fully in line with 
>>>>>>>>> the
>>>>>>>>> first principles of Iceberg (being object store native) as it uses
>>>>>>>>> file-listing.
>>>>>>>>>
>>>>>>>>> I think we should deprecate the HadoopTables to raise the
>>>>>>>>> attention of their users. I would be reluctant to move it to test to 
>>>>>>>>> just
>>>>>>>>> use it for testing purposes, I'd rather remove it and replace its use 
>>>>>>>>> in
>>>>>>>>> tests with the InMemoryCatalog.
>>>>>>>>>
>>>>>>>>> Regarding the StaticTable, this is an easy way to have a read-only
>>>>>>>>> table by directly pointing to the metadata. This also lives in Java 
>>>>>>>>> under
>>>>>>>>> StaticTableOperations
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>>>>>>> It isn't a full-blown catalog where you can list {tables,schemas},
>>>>>>>>> update tables, etc. As ZENOTME pointed out already, it is all up to 
>>>>>>>>> the
>>>>>>>>> user, for example, there is no listing of directories to determine 
>>>>>>>>> which
>>>>>>>>> tables are in the catalog.
>>>>>>>>>
>>>>>>>>> is there a probability that the strategy used by HadoopCatalog is
>>>>>>>>>> not compatible with the table managed by other catalogs?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes, so they are different, you can see in the spec the section on 
>>>>>>>>> File
>>>>>>>>> System tables
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>>>>>>> is used by the HadoopTable implementation. Whereas the other catalogs
>>>>>>>>> follow the Metastore Tables
>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>> Fokko
>>>>>>>>>
>>>>>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com
>>>>>>>>> >:
>>>>>>>>>
>>>>>>>>>> According to our requirements, this function is for some users
>>>>>>>>>> who want to read iceberg tables without relying on any catalogs, I 
>>>>>>>>>> think
>>>>>>>>>> the StaticTable may be more flexible and clear in semantics. For
>>>>>>>>>> StaticTable, it's the user's responsibility to decide which metadata 
>>>>>>>>>> of the
>>>>>>>>>> table to read. But for read-only HadoopCatalog, the metadata may be
>>>>>>>>>> decided by Catalog, is there a probability that the strategy used by
>>>>>>>>>> HadoopCatalog is not compatible with the table managed by other 
>>>>>>>>>> catalogs?
>>>>>>>>>>
>>>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>>>>>>
>>>>>>>>>>> I think there are two ways to do this:
>>>>>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only,
>>>>>>>>>>> and throw unsupported operation exception for other operations that
>>>>>>>>>>> manipulate tables.
>>>>>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we
>>>>>>>>>>> did in pyiceberg or iceberg-rust.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, Renjie
>>>>>>>>>>>>
>>>>>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>>>>>>> FileSystemCatalog to enable direct reading from file systems like 
>>>>>>>>>>>> HDFS, S3,
>>>>>>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't 
>>>>>>>>>>>> support
>>>>>>>>>>>> write operations.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, Ryan:
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is
>>>>>>>>>>>> dangerous in manipulating tables/catalogs given limitations of 
>>>>>>>>>>>> different
>>>>>>>>>>>> file systems. But I see that there are some users who want to read 
>>>>>>>>>>>> iceberg
>>>>>>>>>>>> tables without relying on any catalogs, this is also the 
>>>>>>>>>>>> motivational use
>>>>>>>>>>>> case of StaticTable in pyiceberg and iceberg-rust, is there 
>>>>>>>>>>>> similar things
>>>>>>>>>>>> in java implementation?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> There has been some recent discussion about improving
>>>>>>>>>>>> HadoopTableOperations and the catalog based on those tables, but 
>>>>>>>>>>>> we've
>>>>>>>>>>>> discouraged using file system only table (or "hadoop" tables) for 
>>>>>>>>>>>> years now
>>>>>>>>>>>> because of major problems:
>>>>>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local
>>>>>>>>>>>> file systems, S3, and other common object stores are unsafe
>>>>>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS,
>>>>>>>>>>>> people use the tables in unsafe situations
>>>>>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename
>>>>>>>>>>>> and drop table, which are commonly used in data engineering
>>>>>>>>>>>> * Alternative file names (for instance when using metadata file
>>>>>>>>>>>> compression) also break guarantees
>>>>>>>>>>>>
>>>>>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>>>>>>>> because
>>>>>>>>>>>> there's an appearance that they are a reasonable choice. I propose 
>>>>>>>>>>>> we
>>>>>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog 
>>>>>>>>>>>> implementations and
>>>>>>>>>>>> move them to tests the next time we can make breaking API changes 
>>>>>>>>>>>> (2.0).
>>>>>>>>>>>>
>>>>>>>>>>>> I think we should also consider similar fixes to the table
>>>>>>>>>>>> spec. It currently describes how HadoopTableOperations works, 
>>>>>>>>>>>> which does
>>>>>>>>>>>> not work in object stores or local file systems. HDFS is becoming 
>>>>>>>>>>>> much less
>>>>>>>>>>>> common and I propose that we note that the strategy in the spec 
>>>>>>>>>>>> should ONLY
>>>>>>>>>>>> be used with HDFS.
>>>>>>>>>>>>
>>>>>>>>>>>> What do other people think?
>>>>>>>>>>>>
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Xuanwo
>>>>>>>>>>>>
>>>>>>>>>>>> https://xuanwo.io/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Databricks
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to