Re: Question about replacing files and about Publishing Jars

Arvind Pruthi Tue, 26 Feb 2019 11:01:28 -0800

Well the question/use case has to deal with legacy with Hive Meta Store. Since 
Hive Meta store only tracked partitions and not individual files, use cases are 
built around that assumption. In this case, we have an important house-keeping 
flow which replaces existing files with files of the same name (After massaging 
data inside these files).


This brings me to another fundamental question dealing with legacy of HMS. 
While I understand it is clever and beneficial too in some cases to hide 
partition info, losing that hierarchical abstraction (Between partition -> 
files) is huge change with wide reaching impact for existing users of HMS; and 
arguably one that may not always bear fruit. I would have liked to see this 
hierarchical abstraction retained. It would have made migration much easier for 
existing HMS users, while retaining a useful abstraction.

Have other people on this forum experienced similar pain?

Arvind


From: Ryan Blue <rb...@netflix.com>
Reply-To: "rb...@netflix.com" <rb...@netflix.com>
Date: Tuesday, February 26, 2019 at 9:54 AM
To: Jacques Nadeau <jacq...@dremio.com>
Cc: Iceberg Dev List <dev@iceberg.apache.org>, Arvind Pruthi 
<apru...@linkedin.com>
Subject: Re: Question about replacing files and about Publishing Jars

You could always embed version information in the file location, like S3's 
@<version> syntax. That's just another way to make it unique. Why is it 
necessary to overwrite the original file location though? That's why I don't 
think I understand the use case.

On Tue, Feb 26, 2019 at 9:50 AM Jacques Nadeau 
<jacq...@dremio.com<mailto:jacq...@dremio.com>> wrote:
We're using etag for better clarity on this at Dremio (for a different use 
case). I wonder if the same thing should be available in iceberg.
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Tue, Feb 26, 2019 at 9:48 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
Hi Arvind,

Iceberg assumes that all file locations are unique. If two snapshots refer to 
the same location, then whatever data file (or version) is in that location is 
what is read. What is your use case?

Apache Iceberg has no official releases yet. We still need to do some license 
work for binaries, get the build set up for Apache publication, finish a few 
more PRs, and rename packages. In the mean time, you can use JitPack to build 
binaries for specific commits. That should allow you to easily test the project 
if you don't want to build it yourself.

On Mon, Feb 25, 2019 at 6:27 PM Arvind Pruthi 
<apru...@linkedin.com<mailto:apru...@linkedin.com>> wrote:
Hello There,

Q1. What happens In case a file is deleted and a new file is to be added with 
the same name, but the snapshot in which the delete was registered is still 
around? There is no ambiguity from listing the manifest entries point of view. 
However, there will be ambiguity at the Hdfs level. How is that resolved? Also 
any thoughts on if a file needs to be replaced with a different file with the 
same name (We have a use case for this)?

Q2. Are the iceberg jars being published anywhere? I couldn’t find them in 
maven central.

Thanks,
Arvind


--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: Question about replacing files and about Publishing Jars

Reply via email to