Hi Jacques,
> On 21 May 2019, at 22:11, Jacques Nadeau <jacq...@dremio.com> wrote: > > It’s not at all clear why unique keys would be needed at all. > > If we turn your questions around, you answer yourself. If you have > independent writers, you need unique keys. > > Also truly independent writers (like a job writing while a job compacts), > means effectively a distributed transaction, and I believe it’s clearly out > of scope for Iceberg to solve that ? > > Assuming a single process is writing seems severely limiting in design and > scale. I'm also surprised that you would think this is outside of Iceberg's > scope. A table format that can only be modified by a single process basically > locks that format into a single tool for a particular deployment. That's my point, truly independent writers (two Spark jobs, or a Spark job and Dremio job) means a distributed transaction. It would need yet another external transaction coordinator on top of both Spark and Dremio, Iceberg by itself cannot solve this. By single writer, I don't mean single process, I mean multiple coordinated processes like Spark executors coordinated by Spark driver. The coordinator ensures that the data is pre-partitioned on each executor, and the coordinator commits the snapshot. Note however that single writer job/multiple concurrent reader jobs is perfectly feasible, i.e. it shouldn't be a problem to write from a Spark job and read from multiple Dremio queries concurrently (for example) > > Uniqueness - enforcing uniqueness at scale is not feasible (proovably so). > > Expecting uniqueness is different than enforcing it. If you're saying it is > impossible to enforce, I understand that. If your we can't define a system > where it is expected and there are ramifications if it is not maintained. I'm not sure what you mean exactly. If we can't enforce uniqueness we shouldn't assume it. We do expect that most of the time the natural key is unique, but the eager and lazy with natural key designs can handle duplicates consistently. Basically it's not a problem to have duplicate natural keys, everything works fine. > > Also, at scale, it’s really only feasible to do query and update/upsert on > the partition/bucket/sort key, any other access is likely a full scan of > terabytes of data, on remote storage. > > I'm not sure why you would say unless you assume a particular implementation. > Single record deletion is definitely an important use case. There is no need > to do a full table scan to accomplish that unless you're assuming an eager > approach to deletion. Let me try and clarify each point: - lookup for query or update on a non-(partition/bucket/sort) key predicate implies scanning large amounts of data - because these are the only data structures that can narrow down the lookup, right ? One could argue that the min/max index (file skipping) can be applied to any column, but in reality if that column is not sorted the min/max intervals can have huge overlaps so it may be next to useless. - remote storage - this is a critical architecture decision - implementations on local storage imply a vastly different design for the entire system, storage and compute. - deleting single records per snapshot is unfeasible in eager but also particularly in the lazy design: each deletion creates a very small snapshot. Deleting 1 million records one at a time would create 1 million small files, and 1 million RPC calls. > > I do continue to wonder how much of this back and forth is the mixing of > thinking around restatement (eager) versus delta (lazy) implementations. > Maybe we should separate them out as two different conversations? > Eager is conceptually just lazy + compaction done, well, eagerly. The logic for both is exactly the same, the trade-off is just that with eager you implicitly compact every time so that you don't do any work on read, while with lazy you want to amortize the cost of compaction over multiple snapshots. Basically there should be no difference between the two conceptually, or with regard to keys, etc. The only difference is some mechanics in implementation.