aokolnychyi commented on a change in pull request #3164: URL: https://github.com/apache/iceberg/pull/3164#discussion_r714962375
########## File path: core/src/main/java/org/apache/iceberg/io/FanoutWriter.java ########## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.io; + +import java.io.IOException; +import java.util.Map; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.encryption.EncryptedOutputFile; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.util.StructLikeMap; + +/** + * A writer capable of writing to multiple specs and partitions that keeps files for each + * seen spec/partition pair open until this writer is closed. + * <p> + * As opposed to {@link ClusteredWriter}, this writer does not require the incoming records + * to be clustered by partition spec and partition as all files are kept open. As a consequence, + * this writer may potentially consume substantially more memory compared to {@link ClusteredWriter}. + * Use this writer only when clustering by spec/partition is not possible (e.g. streaming). + */ +abstract class FanoutWriter<T, R> implements PartitioningWriter<T, R> { + + private final Map<Integer, Map<StructLike, FileWriter<T, R>>> writers = Maps.newHashMap(); + private boolean closed = false; + + protected abstract FileWriter<T, R> newWriter(PartitionSpec spec, StructLike partition); + + protected abstract void addResult(R result); + + protected abstract R aggregatedResult(); + + @Override + public void write(T row, PartitionSpec spec, StructLike partition) throws IOException { + FileWriter<T, R> writer = writer(spec, partition); + writer.write(row); + } + + private FileWriter<T, R> writer(PartitionSpec spec, StructLike partition) { + Map<StructLike, FileWriter<T, R>> specWriters = writers.computeIfAbsent( + spec.specId(), + id -> StructLikeMap.create(spec.partitionType())); + FileWriter<T, R> writer = specWriters.get(partition); Review comment: If I am not mistaken, we only use the fanout writer for partitioned tables. Even in the old implementation. You are right about this being the place where we need attention. Like I said [here](https://github.com/apache/iceberg/pull/3164#discussion_r714123188), we have an extra `computeIfAbsent` call and using `StructLikeMap` instead of a regular map with `PartitionKey`. While the performance hit seems to be negligible according to benchmark results I posted, I'd up to optimize this as much as possible. One thing to consider is the performance of `equals` and `hashCode` in `StructLikeWrapper` vs `PartitionKey`. It is relatively simple and efficient in `PartitionKey` where we compare/iterate through object array. In the wrapper, these methods are more involved but don't seem drastically expensive. One optimization idea is to introduce a cache of Comparators and JavaHash objects we use in the wrapper. At this point, we will create a comparator and a java hash for every partition we add to `StructLikeMap`. Even if we write to 1k partitions, I am not sure the difference is noticeable. Another optimization idea can be to introduce a new interface to indicate when a StructLike is backed by an array of values. If two structs implement that interface, we can just compare the arrays in `StructLikeWrapper`. I am going to do a separate benchmark for `HashMap` with `PartitionKey` and `StructLikeMap` with `PartitionKey`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
