Hi,

Here are my thoughts:

- The value of ETag is (as far as I know) defined as an opaque string by the 
specification, meaning the client shouldn’t interpret or assign any 
significance to it, regardless of what the server specifies. It’s best to avoid 
the client giving any particular meaning to the ETag value.
- One major advantage of the header approach compared to other methods is that 
if an update has occurred, the updated content can be immediately included in 
the response without requiring an additional request. This saves one 
request-response round-trip (although 
It’s also possible to define a separate endpoint with the same functionality).
- Since the Iceberg REST catalog server is effectively a type of HTTP server, 
at least in theory, it may be expected to handle HTTP cache and 
validation-related processes. The header approach can be seen as leveraging 
this mechanism appropriately.
- The header approach doesn’t have to be limited to the 
/v1/{prefix}/namespaces/{namespace}/tables/{table} endpoint. It could also be 
applied to all GET-based endpoints, though this might broaden the scope 
significantly.

Thank you.


-----Original Message-----From: "Shani Elharrar" 
<sh...@upsolver.com.invalid>To: <dev@iceberg.apache.org>;Cc: 
<dev@iceberg.apache.org>;Sent: 2024-11-18 (월) 16:21:16 
(UTC+09:00)Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the 
latestUsing the metadata file name as ETag is nice way to go. In that case, 
adding HEAD method support to the loadTable endpoint will return the latest 
metadata pointer, which can be used to support "isLatest" without returning the 
body. It can be also leveraged in order to return the latest metadata location 
of the table.

Shani.
On 18 Nov 2024, at 8:52, Yufei Gu <flyrain...@gmail.com> wrote:
Hi Taeyun,
Thank you for the clear explanation.
I agree that the ETag solution is more suitable. If we were going that way, I'd 
propose a customized version number as an ETag—for instance, leveraging the 
metadata.json file name as the identifier.
To summarize, HTTP caching relies on headers (e.g., ETag or Last-Modified) to 
validate whether a version is up-to-date, whereas the alternative approach 
proposed above uses an additional parameter for verification. From my 
perspective, there isn’t a fundamental difference between the two, so I’m OK 
with either.
A couple of points to note:
Both approaches would require changes to the "loadTable" endpoint.A minor 
advantage of HTTP caching is that it integrates seamlessly with browsers, but 
since most clients of the Iceberg REST catalog aren’t browsers, this may not be 
a significant factor.I’d also recommend considering the requirement to retrieve 
multiple tables(e.g., all tables under a namespace, or a list of table names) 
from the catalog. This requires a new endpoint and may not work with HTTP 
caching.Let me know your thoughts or if there’s anything else to consider.
Yufei





On Sun, Nov 17, 2024 at 6:43 PM Taeyun Kim <taeyun....@innowireless.com 
mailto:taeyun....@innowireless.com> wrote:
Hi,To Gabor:It doesn’t seem necessary to interpret HTTP caching literally in 
this context.Simply using the HTTP headers defined by HTTP caching to check the 
freshness of metadata should be sufficient.There’s no requirement for the 
client to duplicate or store cached HTTP responses.To Yufei:As I understand it, 
the client doesn’t send its own timestamp but instead uses the timestamp 
originally provided by the server in the Last-Modified header.Therefore, clock 
synchronization issues should not be a concern.Here’s the general flow of HTTP 
cache validation based on If-Modified-Since:- Client: initial request:GET (url) 
HTTP/1.1- Server response:HTTP/1.1 200 OK Last-Modified: 
(date1) Cache-Control: no-store, no-cache, max-age=0, must-revalidate, 
proxy-revalidate (with response body)- Client: validation request:GET 
(url) HTTP/1.1 If-Modified-Since: (date1)- Server response (if 
unchanged):HTTP/1.1 304 Not Modified Last-Modified: 
(date1) Cache-Control: no-store, no-cache, max-age=0, must-revalidate, 
proxy-revalidate (without response body)- Server response (if 
updated):HTTP/1.1 200 OK Last-Modified: (date2) Cache-Control: 
no-store, no-cache, max-age=0, must-revalidate, proxy-revalidate (with 
response body)However, using time-based freshness checks can present 
challenges, such as parsing time formats or synchronizing file update times 
across servers.To address these issues, HTTP cache validation based on ETag is 
also defined in the specification.Here’s the flow for ETag-based validation:- 
Client: initial request:GET (url) HTTP/1.1- Server response:HTTP/1.1 200 
OK ETag: "(arbitrary string 1 generated by the 
server)" Cache-Control: no-store, no-cache, max-age=0, must-revalidate, 
proxy-revalidate (with response body)- Client: validation request:GET 
(url) HTTP/1.1 If-None-Match: "(arbitrary string 1 generated by the 
server)"- Server response (if unchanged):HTTP/1.1 304 Not Modified ETag: 
"(arbitrary string 1 generated by the server)" Cache-Control: no-store, 
no-cache, max-age=0, must-revalidate, proxy-revalidate (without response 
body)- Server response (if updated):HTTP/1.1 200 OK ETag: "(arbitrary 
string 2 generated by the server)" Cache-Control: no-store, no-cache, 
max-age=0, must-revalidate, proxy-revalidate (with response body)The 
server can choose to use either If-Modified-Since or ETag for freshness 
validation.Alternatively, to simplify the implementation related to the Iceberg 
REST catalog, it might make sense to define only the more accurate ETag-based 
validation in the spec.For reference, RFC 9110 recommends specifying both ETag 
and Last-Modified. When both are provided, ETag takes precedence.Note on 
Cache-Control Headers:The Cache-Control values in the examples above are 
intended to ensure that the client validates freshness with the server on every 
request. Writing the header in this extended format is primarily to accommodate 
outdated HTTP/1.1 implementations. However, under the HTTP/1.1 specification, 
the following is sufficient:Cache-Control: no-cacheThat’s all for now.Thank 
you.-----Original Message-----From: "Yufei Gu" <flyrain...@gmail.com 
mailto:flyrain...@gmail.com>To: <dev@iceberg.apache.org 
mailto:dev@iceberg.apache.org>;Cc:Sent: 2024-11-16 (토) 02:51:05 
(UTC+09:00)Subject: Re: [DISCUSS] REST: Way to query if metadata pointer is the 
latestHow does HTTP caching handle desynchronized clocks between clients and 
the server?At t0, the client gets the latest table version.At t1, the server 
makes a new commit.At t2, the client sends a request with a timestamp t2, but 
due to desynchronization, it refers to t0.The server may reply with 304 Not 
Modified, causing the client to think its cache is up-to-date and miss the 
commit at t1.YufeiOn Fri, Nov 15, 2024 at 6:37 AM Gabor Kaszab 
<gaborkas...@apache.org mailto:gaborkas...@apache.org> wrote:Hi All,First 
of all it's great to see that there are others who could benefit from giving a 
solution to this problem. I appreciate all the comments and feedback so 
far.There were a number of different opinions, so let me start with summarizing 
the different topics that came up:New endpoint vs using an existing 
endpoint:Based on the answers (Fokko, Yufei) I had the impression that we 
should be careful when adding new REST endpoints, and we should examine the 
re-use of existing endpoints first. Let's do that then, and in case we don't 
find it feasible then we can still fall back to any of my initial proposals 
(isLatest() or metadataLocation()).Granularity of freshness checks:It was 
brought up (Dmitri) that we might not want to do the metadata freshness checks 
solely based on metadata location, but we should consider doing more granular 
freshness checks. I personally don't see much benefit of designing this 
solution like that, TBH, but seeing some use-cases could help us understand the 
motivation here.Let me share my opinion on some of the arguments:"A change in 
metadata location does not necessarily mean a change in metadata content"AFAIK 
whenever Iceberg creates a new metadata file there is some change in the 
metadata itself. There might not be a new snapshot, though in the cases of e.g. 
a schema/partition evolution. But even in these cases triggering a table reload 
could make sense to me (e.g. answering SHOW CREATE TABLE and similar queries). 
Additionally, I'd assume the number of metadata location changes that don't 
create a new snapshot is too negligible to optimize for.Dmitri, let me know if 
I misunderstood something."it may still be beneficial to permit the client to 
ask for changes to specific areas of metadata"This seems like a use-case that 
the partial metadata loading proposal could solve. To identify the need to load 
a specific part of the metadata with partial metadata loading seems an overkill 
to design with my proposal, if this is what you have in mind. Also I found that 
the partial metadata loading proposal faces serious headwinds, so I 
wouldn't rely on it at the moment.Re-using tableExistsI think there is a 
consensus here that tableExists returning a metadata location could work but 
seems more like a workaround and could be misleading for the users.Partial 
metadata loading could solve this:(Yufei) I agree, it would be perfect for my 
use-case and I'm following the discussion on the proposal. However, for me it 
seems, as I wrote above, that the proposal faces serious headwinds now and I 
honestly wouldn't expect a solution in the short term. But solving the 
freshness problems is a more urgent thing to solve, not just for myself and 
Impala but apparently to many other stakeholders in the community according to 
the interest on this thread.Hence, I propose to come up with a separate 
solution for freshness checks, and we can still move to using partial metadata 
loading once that's out.Use HTTPCache and If-Modified-Since with loadTableThis 
solution seems to do the trick for us. Let me do some research myself to see if 
there are any difficulties implementing this. Currently, I have more questions 
than answers wrt this approach :)- The initial problem is to answer freshness 
questions for the cached tables on the client side. If we introduce HttpCaching 
wouldn't we introduce the same problem but on a different level of 
representation. We'd then need to decide the freshness/staleness of the cached 
data in the HTTP layer.- If we cache the HTTP responses for a loadTable then we 
essentially cache the content of the metadata.jsons including the snapshot and 
metadata log and everything, plus the snapshot list (and I think the manifests 
for the latest snapshot). I believe that the size of this can easily reach 
the low megabytes range in memory, so in total keeping them in the HTTP Cache 
for all the tables we have queried can easily mean that we keep a couple of GBs 
in memory just for this purpose.For engines that already cache table metadata 
wouldn't this mean that we will cache some parts of the metadata redundantly?- 
How would we decide what is the max-age of a cached table metadata in the HTTP 
Cache? Would it be configurable so that each engine could use whatever it 
prefers?Sorry if any of the questions doesn't make sense, I just want to make 
sure I understand all the aspects of this approach.An additional topic I have 
in mind:REST catalog vs other catalogs:Now we are focusing our discussion on 
the REST spec, but I think it would be beneficial to extend our focus and cover 
other catalog implementations too. I don't think that this problem of data 
freshness is specific to REST catalog, it could affect any table in any other 
catalog too.I'll continue my investigation wrt the proposals, I just wanted to 
flush out and sum up what we have now before the weekend.Regards,GaborOn Fri, 
Nov 15, 2024 at 10:16 AM Jean-Baptiste Onofré <j...@nanthrax.net 
mailto:j...@nanthrax.net> wrote:Hi,I like the idea and it makes sense. As 
soon as it's clearly stated inthe spec (using If-Modified-Since header and 304 
status code), itlooks good to me.Thanks !RegardsJBOn Fri, Nov 15, 2024 at 1:58 
AM Taeyun Kim <taeyun....@innowireless.com 
mailto:taeyun....@innowireless.com> wrote:>> Hi,>> (Apologies if 
this email is a duplicate. This is my third attempt.)>> I also need a way 
to ensure that my table data is up-to-date. For now, I’m handling this by 
setting an expiration period after which I fetch the data again, regardless of 
its freshness.>> Here are my thoughts on the current suggestions. Please 
correct me if I've misunderstood any of the points.>> - isLatest(): This 
function could be inefficient since it would require an additional round-trip 
to fetch the metadata if it’s not up-to-date. This would result in two 
round-trips overall, which seems suboptimal.> - metadataLocation(): This has 
a similar issue as isLatest(). BTW, according to the REST catalog API 
documentation for LoadTableResult schema, it states, "Clients can check whether 
metadata has changed by comparing metadata locations after the table has been 
created." 
(https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175
 
https://github.com/apache/iceberg/blob/3659ded18d50206576985339bd55cd82f5e200cc/open-api/rest-catalog-open-api.yaml#L3175)
 This suggests that if the metadata location has changed, the metadata can be 
considered updated.> - tableExists(): Based on the name, this function seems 
to serve a different purpose.>> Here is my suggestion:>> Since HTTP 
has built-in caching features 
(https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching), and REST catalogs 
operate over HTTP, it seems natural to leverage HTTP caching mechanisms. For 
example, HTTP includes the If-Modified-Since header and the 304 Not Modified 
status code. Using this approach, we could achieve data freshness with a single 
round-trip, fetching updated data only if there are modifications.>> What 
do you think about defining the spec in this direction?>> Thank 
you.>>>>> -----Original Message-----> From: "Yufei Gu" 
<flyrain...@gmail.com mailto:flyrain...@gmail.com>> To: 
<dev@iceberg.apache.org mailto:dev@iceberg.apache.org>;> Cc:> Sent: 
2024-11-13 (수) 03:43:24 (UTC+09:00)> Subject: Re: [DISCUSS] REST: Way to 
query if metadata pointer is the latest>>>> Hi Gamber,>> 
Thanks for the proposal! Impala isn’t unique in needing this—I've seen similar 
requirements from other engines.>> As others pointed out, using the 
“tableExists” endpoint seems like a workaround. I don't consider it a permanent 
way forward. We could address this by either modifying the current load table 
endpoint or introducing a new one, but ideally, we should avoid adding 
endpoints for every specific need. With that, partial metadata loading seems 
like a strong approach here, we will need certain agreement though. I'd suggest 
the community consider the use cases seriously. We need a way forward.>> 
I’m also not too concerned about using metadata file paths to verify the latest 
table version; clients can simply extract metadata filenames, which include the 
UUID.>> Yufei>>>>> On Tue, Nov 12, 2024 at 7:46 AM 
Jean-Baptiste Onofré <j...@nanthrax.net mailto:j...@nanthrax.net> 
wrote:>> Hi Fokko>> I like the idea, but I think it's more a 
workaround and could be> confusing for users :)>> Regards> 
JB>> On Tue, Nov 12, 2024 at 2:53 PM Fokko Driesprong 
<fo...@apache.org mailto:fo...@apache.org> wrote:> >> > Hey 
Gabor,> >> > Thanks for raising this. While reading this, my first 
thought is to leverage the `tableExists` operation:> > 
https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160
 
https://github.com/apache/iceberg/blob/e3f39972863f891481ad9f5a559ffef093976bd7/open-api/rest-catalog-open-api.yaml#L1129-L1160>
 >> > This doesn't return anything today, but we could return a 
payload to the latest metadata.json.> >> > Looking forward to what 
others think.> >> > Kind regards,> > Fokko> >> 
>> >> >> > Op di 12 nov 2024 om 14:33 schreef Shani 
Elharrar <sh...@upsolver.com.invalid>:> >>> >> I 
recommend option (b), provided there is no partial metadata loading. We 
implemented option (b) internally to facilitate partial metadata loading, as we 
have tables with hundreds of thousands of snapshots. This results in metadata 
that occupies approximately 500 MB in memory (excluding the JsonNodes), which 
is a significant load for some of our services.> >>> >> 
Shani.> >>> >> On 12 Nov 2024, at 14:12, Gabor Kaszab 
<gaborkas...@apache.org mailto:gaborkas...@apache.org> wrote:> 
>>> >> Hey Iceberg Community,> >>> >> 
Background:> >> Impala is designed in a way to cache the Iceberg table 
metadata (BaseTable objects in practice) for faster access. Currently, Impala 
is tightly coupled with HMS and in turn with the HiveCatalog, and in order to 
keep the cached table objects up-to-date there is a notification mechanism 
driven by HMS to notify Impala about any changes in the table metadata.> 
>> The Impala community is actively looking for ways to decouple HMS from 
Impala and provide a way to use Impala without the need for HMS, and get the 
Iceberg table metadata from other catalog Implementations mainly focusing now 
on REST catalogs.> >>> >> Problem to solve:> >> We 
identified a particular missing functionality in the current REST spec: For 
engines that cache table metadata currently there is no way to check if that 
table metadata is up-to-date or not, and whether the engine should reload the 
metadata for that table or not without getting a whole table object from the 
catalog. For this I think the REST catalog (but in fact I think this could 
apply to any other catalogs) should be able to answer a question like:> 
>> "Hi Catalog, I have this version of this table, is it up-to-date?"> 
>>> >> Proposal:> >> I've been following the discussion 
about partial metadata loading that could be also used to answer the above 
question, but I have the impression now that the conversation stopped making 
any progress.> >> So instead of waiting for partial metadata loading I 
propose to have an addition to the REST spec now to answer the question I 
raised above:> >>> >> a) boolean isLatest(TableIdentifier 
ident, String metadataLocation);> >> b) String 
metadataLocation(TableIdentifier ident);> >>> >> Any of the 
above 2 approaches could help engines to decide if they have to 
invalidate/reload particular table metadata in the cache. I personally would go 
for option a) but would be open to hear other opinions.> >>> 
>> I'd like to know if the community could support me extending the REST 
spec with any of the 2 options.> >>> >> Regards,> >> 
Gabor> >>> >>


Reply via email to