Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Jean-Baptiste Onofré Tue, 30 Jul 2024 02:15:46 -0700

I agree with Fokko here.

I'm happy to improve the JDBC Catalog if needed (I plan to add new adapters
and maybe leverage jbid to support more RDBMS).


Regards
JB

On Tue, Jul 30, 2024 at 10:22 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey everyone,
>
> Lisoda,
>
> In recent days, I've attempted to migrate hadoop_catalog to jdbc-catalog,
>> but I failed.
>
>
> Was this because the JDBC(or SQL)-Catalog didn't work, or the migration
> was not feasible? In the case of the first, I invite you to raise an issue
> on Github to see what's happening.
>
> Next to the HadoopCatalog there is also the SQL-Catalog as
> mentioned above. This is available in Java, PyIceberg, and in-flight for
> Rust. While for the HadoopCatalog the correctness depends on the guarantee
> of the underlying storage, with the SQLCatalog we can also move forward and
> implement features like multi-table transactions. PyIceberg relies heavily
> on the SQLCatalog with an in-memory database (SQLite) for integration tests.
>
> Since there is no consensus, I believe clarifying the spec and moving the
> HadoopCatalog to a separate package are the first two steps.
>
> Kind regards,
> Fokko
>
> Op di 30 jul 2024 om 09:43 schreef Gabor Kaszab <gaborkas...@apache.org>:
>
>> Hey Iceberg Community,
>>
>> Sorry, for being late to this conversation. I just wanted to share that
>> I'm against deprecating HadoopCatalog or moving it to tests. Currently
>> Impala relies heavily on HadoopCatalog for it's own tests and I personally
>> find HadoopCatalog pretty handy when I just want to do some cross-engine
>> experiments where my data is already on HDFS and I just write a table with
>> engineA and see if engineB can read it and I don't want to bother with
>> setting up any services to serve as an Iceberg catalog (HMS for instance).
>>
>> I believe that even though we don't consider HadoopCatalog a production
>> grade solution as it is now, it has its benefits for lightweight
>> experimentation.
>>
>>    - I'm +1 for keeping HadoopCatalog
>>    - We should emphasize that HDFS is the desired storage for
>>    HadoopCatalog (can we force this in the code?)
>>    - Apparently, there is a part of this community that is open to add
>>    enhancements to HadoopCatalog to bring it closer to production gradeness
>>    (lisoda). I don't think we shouldn't block these contributions.
>>    - If we say that REST Catalog is preferred over HadoopCatalog I think
>>    the Iceberg project should offer its own open-source solution available 
>> for
>>    everyone.
>>
>> Regards,
>> Gabor
>>
>> On Thu, Jul 25, 2024 at 9:04 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> There are ways to use object store or file system features to do this,
>>> but there are a lot of variations. Building implementations and trying to
>>> standardize each one is a lot of work. And then you still get a catalog
>>> that doesn't support important features.
>>>
>>> I don't think that this is a good direction to build for the Iceberg
>>> project. But I also have no objection to someone doing it in a different
>>> project that uses the Iceberg metadata format.
>>>
>>> On Tue, Jul 23, 2024 at 5:57 PM lisoda <lis...@yeah.net> wrote:
>>>
>>>>
>>>> Sir, regarding this point, we have some experience. In my view, as long
>>>> as the file system supports atomic single-file writing, where the file
>>>> becomes immediately visible upon the client's successful write operation,
>>>> that is sufficient. We can do without the rename operation as long as the
>>>> file system guarantees this feature. Of course, if the object storage
>>>> system supports mutex operations, we can also uniformly use the rename
>>>> operation for committing. We can theoretically avoid the situation of
>>>> providing a large number of commit strategies for different file systems.
>>>> ---- Replied Message ----
>>>> From Jack Ye<yezhao...@gmail.com> <yezhao...@gmail.com>
>>>> Date 07/24/2024 02:52
>>>> To dev@iceberg.apache.org
>>>> Cc
>>>> Subject Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to
>>>> tests in 2.0
>>>> If we come up with a new storage-only catalog implementation that could
>>>> solve those limitations and also leverage the new features being developed
>>>> in object storage, would that be a potential alternative strategy? so
>>>> HadoopCatalog users has a way to move forward with still a storage-only
>>>> catalog that can run on HDFS, and we can fully deprecate HadoopCatalog.
>>>>
>>>> -Jack
>>>>
>>>> On Tue, Jul 23, 2024 at 10:00 AM Ryan Blue <b...@databricks.com.invalid>
>>>> wrote:
>>>>
>>>>> I don't think we would want to put this in a module with other catalog
>>>>> implementations. It has serious limitations and is actively discouraged,
>>>>> while the other catalog implementations still have value as either REST
>>>>> back-end catalogs or as regular catalogs for many users.
>>>>>
>>>>> On Tue, Jul 23, 2024 at 9:11 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>>
>>>>>> For some additional information, we also have some Iceberg HDFS users
>>>>>> on EMR. Those are mainly users that have long-running Hadoop and HBase
>>>>>> installations. They typically refresh their installation every 1-2 years.
>>>>>> From my understanding, they use S3 for data storage, but metadata is kept
>>>>>> in the local HDFS cluster, thus HadoopCatalog works well for them.
>>>>>>
>>>>>> I remember we discussed moving all catalog implementations in the
>>>>>> main repo right now to a separated iceberg-catalogs repo. Could we do 
>>>>>> this
>>>>>> move as a part of that effort?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Tue, Jul 23, 2024 at 8:46 AM Ryan Blue <
>>>>>> b...@databricks.com.invalid> wrote:
>>>>>>
>>>>>>> Thanks for the context, lisoda. I agree that it's good to understand
>>>>>>> the issues you're facing with the HadoopCatalog. One follow up question
>>>>>>> that I have is what the underlying storage is. Are you using HDFS for 
>>>>>>> those
>>>>>>> 30,000 customers?
>>>>>>>
>>>>>>> I think you're right that there is a challenge to migrating. Because
>>>>>>> there is no catalog requirement, it's hard to make sure you have all of 
>>>>>>> the
>>>>>>> writers migrated. I think that means we do need to have a plan or
>>>>>>> recommendation for people currently using this catalog in production, 
>>>>>>> but
>>>>>>> it also puts more pressure on us to deprecate this catalog and avoid 
>>>>>>> more
>>>>>>> people having this problem.
>>>>>>>
>>>>>>> I think it's a good idea to make the spec change, which we have
>>>>>>> agreement for and to ensure that the FS catalog and table operations are
>>>>>>> properly deprecated to show that they should not be used. I'm not sure
>>>>>>> whether there is support in the community for moving the implementation
>>>>>>> into a new iceberg-hadoop module, but at a minimum we can't just remove
>>>>>>> this right away. I think that a separate iceberg-hadoop module would 
>>>>>>> make
>>>>>>> the most sense.
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 11:09 PM lisoda <lis...@yeah.net> wrote:
>>>>>>>
>>>>>>>> Hi team.
>>>>>>>>      I am not a pmc member, just a regular user. Instead of
>>>>>>>> discussing whether hadoopcatalog needs to continue to exist, I'd like 
>>>>>>>> to
>>>>>>>> share a more practical issue.
>>>>>>>>
>>>>>>>>     We currently serve over 30,000 customers, all of whom use
>>>>>>>> Iceberg to store their foundational data, and all business analyses are
>>>>>>>> conducted based on Iceberg. However, all the Iceberg tables are
>>>>>>>> hadoop_catalog. At least, this has been the case since I started 
>>>>>>>> working
>>>>>>>> with our production environment system.
>>>>>>>>
>>>>>>>>     In recent days, I've attempted to migrate hadoop_catalog to
>>>>>>>> jdbc-catalog, but I failed. We store 2PB of data, and replacing the 
>>>>>>>> current
>>>>>>>> catalogues has become an almost impossible task. Users not only create
>>>>>>>> hadoop_catalog tables through Spark, they also continuously use 
>>>>>>>> third-party
>>>>>>>> OLAP systems/FLINK and other means to write data into Iceberg in the 
>>>>>>>> form
>>>>>>>> of hadoop_catalog. Given this situation, we can only continue to fix
>>>>>>>> hadoop_catalog and provide services to customers.
>>>>>>>>
>>>>>>>>     I understand that the community wants to make a big push into
>>>>>>>> rest-catalog, and I agree with the direction the community is 
>>>>>>>> going.But considering
>>>>>>>> that there might be a significant number of users facing similar 
>>>>>>>> issues,
>>>>>>>> can we at least retain a module similar to iceberg-hadoop to extend
>>>>>>>> hadoop_catalog? If it is removed, we won't be able to continue 
>>>>>>>> providing
>>>>>>>> services to customers. So, if possible, please consider this option.
>>>>>>>>
>>>>>>>> Thank you all.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> lisoda
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Thank you for bringing this up Ryan. I have been also in the camp
>>>>>>>> of saying HadoopCatalog is not recommended, but after thinking about 
>>>>>>>> this
>>>>>>>> more deeply last night, I now have mixed feelings about this topic. 
>>>>>>>> Just to
>>>>>>>> comment on the reasons you listed first:
>>>>>>>>
>>>>>>>> * For reason 1 & 2, it looks like the root cause is that people try
>>>>>>>> to use HadoopCatalog outside native HDFS because there are HDFS 
>>>>>>>> connectors
>>>>>>>> to other storages like S3AFileSystem. However, the norm for such usage 
>>>>>>>> has
>>>>>>>> been that those connectors do not strictly follow HDFS semantics, and 
>>>>>>>> it is
>>>>>>>> assumed that people acknowledge the implication of such usage and 
>>>>>>>> accept
>>>>>>>> the risk. For example, S3AFileSystem was there even before S3 was 
>>>>>>>> strongly
>>>>>>>> consistent, but people have been using that to write files.
>>>>>>>>
>>>>>>>> * For reason 3, there are multiple catalogs that do not support all
>>>>>>>> operations (e.g. Glue for atomic table rename) and people still widely 
>>>>>>>> use
>>>>>>>> it.
>>>>>>>>
>>>>>>>> * For reason 4, I see that more as a missing feature. More features
>>>>>>>> could definitely be developed in that catalog implementation.
>>>>>>>>
>>>>>>>> So the key question to me is, how can we prevent people from using
>>>>>>>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular 
>>>>>>>> because
>>>>>>>> it is a storage only solution. For object storages specifically,
>>>>>>>> HadoopCatalog is not suitable for 2 reasons:
>>>>>>>>
>>>>>>>> (1) file write does not enforce mutual exclusion, thus cannot
>>>>>>>> enforce Iceberg optimistic concurrency requirement (a.k.a. cannot do 
>>>>>>>> atomic
>>>>>>>> and swap)
>>>>>>>>
>>>>>>>> (2) directory-based design is not preferred in object storage and
>>>>>>>> will result in bad performance.
>>>>>>>>
>>>>>>>> However, now I look at these 2 issues, they are getting outdated.
>>>>>>>>
>>>>>>>> (1) object storage is starting to enforce file mutual exclusion.
>>>>>>>> GCS supports file generation number [1] that increments monotonically, 
>>>>>>>> and
>>>>>>>> can use x-goog-if-generation-match [2] to perform atomic swap. Similar
>>>>>>>> feature [3] exists in Azure Blob Storage. I cannot speak for the S3 
>>>>>>>> team
>>>>>>>> roadmap. But Amazon S3 is clearly falling behind in this domain, and 
>>>>>>>> with
>>>>>>>> market competition, it is very clear that similar features will come in
>>>>>>>> reasonably near future.
>>>>>>>>
>>>>>>>> (2) directory bucket is becoming the norm. Amazon S3 announced
>>>>>>>> directory bucket in 2023 re:invent [4], which does not have the same
>>>>>>>> performance limitation even if you have very nested folders and many
>>>>>>>> objects in a folder. GCS also has a similar feature launched in 
>>>>>>>> preview [5]
>>>>>>>> right now. Azure also already has this feature since 2021 [6].
>>>>>>>>
>>>>>>>> With these new developments in the industry, a storage-only Iceberg
>>>>>>>> catalog becomes very attractive. It is simple with only one service
>>>>>>>> dependency. It can safely perform atomic compare-and-swap. It is 
>>>>>>>> performant
>>>>>>>> without the need to worry about folder and file organization. If you 
>>>>>>>> want
>>>>>>>> to add additional features for things like access control, there are 
>>>>>>>> also
>>>>>>>> integrations like access grant [7] that can be integrated to do it in a
>>>>>>>> very scalable way.
>>>>>>>>
>>>>>>>> I know the direction in the community so far is to go with the REST
>>>>>>>> catalog, and I am personally a big advocate for that. However, that
>>>>>>>> requires either building a full REST catalog, or choosing a catalog 
>>>>>>>> vendor
>>>>>>>> that supports REST. There are many capabilities that REST would 
>>>>>>>> unlock, but
>>>>>>>> those are visions which I expect will take many years down the road 
>>>>>>>> for the
>>>>>>>> community to continue to drive consensus and build those features. If 
>>>>>>>> I am
>>>>>>>> the CTO of a small company and I just want an Iceberg data lake(house)
>>>>>>>> right now, do I choose REST, or do I choose (or even just build) a
>>>>>>>> storage-only Iceberg catalog? I feel I would actually choose the later.
>>>>>>>>
>>>>>>>> Going back to the discussion points, my current take of this topic
>>>>>>>> is that:
>>>>>>>>
>>>>>>>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS
>>>>>>>> in the spec.
>>>>>>>>
>>>>>>>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by
>>>>>>>> default (e.g. fail if using S3A), but we should allow a feature flag to
>>>>>>>> unblock the usage so that people can use it after understanding the
>>>>>>>> implications and risks, just like how people use S3A today.
>>>>>>>>
>>>>>>>> (3) +0 for removing HadoopCatalog from the core library. It could
>>>>>>>> be in a different module like iceberg-hdfs if that is more suitable.
>>>>>>>>
>>>>>>>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a
>>>>>>>> valid use case for Iceberg. After the measures 1-3 above, people 
>>>>>>>> actually
>>>>>>>> having a HDFS use case should be able to continue to innovate and 
>>>>>>>> optimize
>>>>>>>> the HadoopCatalog implementation. Although "HDFS is becoming much less
>>>>>>>> common", looking at GitHub issues and discussion forums, it still has a
>>>>>>>> pretty big user base.
>>>>>>>>
>>>>>>>> (5) In general, I propose we separate the discussion of
>>>>>>>> HadoopCatalog from a "storage only catalog" that also deals with other
>>>>>>>> object stages when evaluating it. With these latest industry 
>>>>>>>> developments,
>>>>>>>> we should evaluate the direction for building a storage only Iceberg
>>>>>>>> catalog and see if the community has an interest in that. I could help
>>>>>>>> raise a thread about it after this discussion is closed.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jack Ye
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
>>>>>>>> [2]
>>>>>>>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
>>>>>>>> [3]
>>>>>>>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
>>>>>>>> [4]
>>>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
>>>>>>>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
>>>>>>>> [6]
>>>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
>>>>>>>> [7]
>>>>>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
>>>>>>>> etudenhoef...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> +1 on deprecating now and removing them from the codebase with
>>>>>>>>> Iceberg 2.0
>>>>>>>>>
>>>>>>>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <
>>>>>>>>> ajanthab...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1 on deprecating the `File System Tables` from spec and
>>>>>>>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>>>>>>>>> and removing them permanently during 2.0 release.
>>>>>>>>>>
>>>>>>>>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>>>>>>>>
>>>>>>>>>> I am not sure about moving to test or keeping them only for HDFS.
>>>>>>>>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>>>>>>>>
>>>>>>>>>> I wanted to have it deprecated 2 years ago
>>>>>>>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>>>>>>>>> and I remember that we discussed it in sync that time and left it as 
>>>>>>>>>> it is.
>>>>>>>>>> Also, when the user brought this up in slack
>>>>>>>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>>>>>>>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>>>>>>>>> I have asked to open this discussion on the mailing list. So,
>>>>>>>>>> that we can conclude it once and for all.
>>>>>>>>>>
>>>>>>>>>> - Ajantha
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <
>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Ryan and others,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for bringing this up. I would be in favor of removing the
>>>>>>>>>>> HadoopTableOperations, mostly because of the reasons that you 
>>>>>>>>>>> already
>>>>>>>>>>> mentioned, but also about the fact that it is not fully in line 
>>>>>>>>>>> with the
>>>>>>>>>>> first principles of Iceberg (being object store native) as it uses
>>>>>>>>>>> file-listing.
>>>>>>>>>>>
>>>>>>>>>>> I think we should deprecate the HadoopTables to raise the
>>>>>>>>>>> attention of their users. I would be reluctant to move it to test 
>>>>>>>>>>> to just
>>>>>>>>>>> use it for testing purposes, I'd rather remove it and replace its 
>>>>>>>>>>> use in
>>>>>>>>>>> tests with the InMemoryCatalog.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the StaticTable, this is an easy way to have a
>>>>>>>>>>> read-only table by directly pointing to the metadata. This also 
>>>>>>>>>>> lives in
>>>>>>>>>>> Java under StaticTableOperations
>>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>>>>>>>>> It isn't a full-blown catalog where you can list
>>>>>>>>>>> {tables,schemas}, update tables, etc. As ZENOTME pointed out
>>>>>>>>>>> already, it is all up to the user, for example, there is no listing 
>>>>>>>>>>> of
>>>>>>>>>>> directories to determine which tables are in the catalog.
>>>>>>>>>>>
>>>>>>>>>>> is there a probability that the strategy used by HadoopCatalog
>>>>>>>>>>>> is not compatible with the table managed by other catalogs?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes, so they are different, you can see in the spec the section
>>>>>>>>>>> on File System tables
>>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>>>>>>>>> is used by the HadoopTable implementation. Whereas the other 
>>>>>>>>>>> catalogs
>>>>>>>>>>> follow the Metastore Tables
>>>>>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Fokko
>>>>>>>>>>>
>>>>>>>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <
>>>>>>>>>>> st810918...@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> According to our requirements, this function is for some users
>>>>>>>>>>>> who want to read iceberg tables without relying on any catalogs, I 
>>>>>>>>>>>> think
>>>>>>>>>>>> the StaticTable may be more flexible and clear in semantics. For
>>>>>>>>>>>> StaticTable, it's the user's responsibility to decide which 
>>>>>>>>>>>> metadata of the
>>>>>>>>>>>> table to read. But for read-only HadoopCatalog, the metadata may be
>>>>>>>>>>>> decided by Catalog, is there a probability that the strategy used 
>>>>>>>>>>>> by
>>>>>>>>>>>> HadoopCatalog is not compatible with the table managed by other 
>>>>>>>>>>>> catalogs?
>>>>>>>>>>>>
>>>>>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> I think there are two ways to do this:
>>>>>>>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only,
>>>>>>>>>>>>> and throw unsupported operation exception for other operations 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> manipulate tables.
>>>>>>>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we
>>>>>>>>>>>>> did in pyiceberg or iceberg-rust.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, Renjie
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>>>>>>>>> FileSystemCatalog to enable direct reading from file systems 
>>>>>>>>>>>>>> like HDFS, S3,
>>>>>>>>>>>>>> and Azure Blob Storage? This catalog will be read-only that 
>>>>>>>>>>>>>> don't support
>>>>>>>>>>>>>> write operations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, Ryan:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is
>>>>>>>>>>>>>> dangerous in manipulating tables/catalogs given limitations of 
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> file systems. But I see that there are some users who want to 
>>>>>>>>>>>>>> read iceberg
>>>>>>>>>>>>>> tables without relying on any catalogs, this is also the 
>>>>>>>>>>>>>> motivational use
>>>>>>>>>>>>>> case of StaticTable in pyiceberg and iceberg-rust, is there 
>>>>>>>>>>>>>> similar things
>>>>>>>>>>>>>> in java implementation?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There has been some recent discussion about improving
>>>>>>>>>>>>>> HadoopTableOperations and the catalog based on those tables, but 
>>>>>>>>>>>>>> we've
>>>>>>>>>>>>>> discouraged using file system only table (or "hadoop" tables) 
>>>>>>>>>>>>>> for years now
>>>>>>>>>>>>>> because of major problems:
>>>>>>>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local
>>>>>>>>>>>>>> file systems, S3, and other common object stores are unsafe
>>>>>>>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS,
>>>>>>>>>>>>>> people use the tables in unsafe situations
>>>>>>>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename
>>>>>>>>>>>>>> and drop table, which are commonly used in data engineering
>>>>>>>>>>>>>> * Alternative file names (for instance when using metadata
>>>>>>>>>>>>>> file compression) also break guarantees
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>>>>>>>> scenarios, I think it's misleading to have them in the core 
>>>>>>>>>>>>>> module because
>>>>>>>>>>>>>> there's an appearance that they are a reasonable choice. I 
>>>>>>>>>>>>>> propose we
>>>>>>>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog 
>>>>>>>>>>>>>> implementations and
>>>>>>>>>>>>>> move them to tests the next time we can make breaking API 
>>>>>>>>>>>>>> changes (2.0).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we should also consider similar fixes to the table
>>>>>>>>>>>>>> spec. It currently describes how HadoopTableOperations works, 
>>>>>>>>>>>>>> which does
>>>>>>>>>>>>>> not work in object stores or local file systems. HDFS is 
>>>>>>>>>>>>>> becoming much less
>>>>>>>>>>>>>> common and I propose that we note that the strategy in the spec 
>>>>>>>>>>>>>> should ONLY
>>>>>>>>>>>>>> be used with HDFS.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do other people think?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Xuanwo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://xuanwo.io/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Databricks
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Databricks
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to