[GitHub] [iceberg] jackye1995 commented on a change in pull request #2678: Core: add HadoopConfigurable interface to serialize custom FileIO

GitBox Tue, 22 Jun 2021 00:48:56 -0700


jackye1995 commented on a change in pull request #2678:
URL: https://github.com/apache/iceberg/pull/2678#discussion_r655862496




##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -132,7 +132,7 @@
     this.splitLookback = 
options.get(SparkReadOptions.LOOKBACK).map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = 
options.get(SparkReadOptions.FILE_OPEN_COST).map(Long::parseLong).orElse(null);
 
-    if (table.io() instanceof HadoopFileIO) {
+    if (table.io() instanceof HadoopConfigurable) {

Review comment:
       This seems to be related to the discussion in the dev list regarding the 
locality read configuration. In the use case I am trying to support, it 
actually needs to run that code path and turn locality read to true by default 
although it is not a `HadoopFileIO`. So we have both cases to support, and it 
is not sufficient to determine preference of locality read purely based on the 
FileIO implementation and file URI. 
   
   In the code path, because it is at reader initialization time, table 
property seems to be the best place to store this default behavior, although I 
agree this is not really elegant as it is spark specific.
   
   For now, I think using the `HadoopConfigurable` check instead of 
`HadoopFileIO` check is more flexible, because users who do not need locality 
for `HadoopConfigurable` are likely not using HDFS anyway and will fail the 
check, and they can also use `locality` option to override the choice to turn 
it off.
   
   

##########
File path: core/src/main/java/org/apache/iceberg/util/SerializationUtil.java
##########
@@ -43,6 +45,16 @@ private SerializationUtil() {
     }
   }
 
+  public static byte[] serializeToBytesWithHadoopConfig(Object obj) {

Review comment:
       The original thinking was that the use of 
`org.apache.iceberg.hadoop.SerializableConfiguration` was not always a default. 
`SerializableTable` uses a map based serializer, Spark also has its own 
serializer. So I don't want the user to blindly assume that `serializeToBytes` 
would take care of the Hadoop configuration in all places. If we are making it 
a default, then I will add a documentation for `serializeToBytes` to make this 
clear.

##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.util.SerializableSupplier;
+
+/**
+ * An interface that extends the Hadoop {@link Configurable} interface to
+ * offer better serialization support for customizable Iceberg objects
+ * such as {@link org.apache.iceberg.io.FileIO}.
+ * <p>
+ * If an object is serialized and needs to use Hadoop configuration,
+ * it is recommend for the object to implement this interface so that
+ * a serializable supplier of configuration can be provided instead of
+ * an actual Hadoop configuration which is not serializable.
+ */
+public interface HadoopConfigurable extends Configurable {
+
+  /**
+   * A serializable object requiring Hadoop configuration in Iceberg must be 
able to
+   * accept a Hadoop configuration supplier,and use the supplier to get Hadoop 
configurations.
+   * This ensures that Hadoop configuration can be serialized and passed 
around works in a distributed environment.
+   * @param confSupplier a serializable supplier of Hadoop configuration
+   */
+  default void setConfSupplier(SerializableSupplier<Configuration> 
confSupplier) {

Review comment:
       fixed. I was thinking about backwards compatibility, but there isn't 
really anything to be backwards compatible in this case...

##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopConfigurable.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hadoop;
+
+import org.apache.hadoop.conf.Configurable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.util.SerializableSupplier;
+
+/**
+ * An interface that extends the Hadoop {@link Configurable} interface to
+ * offer better serialization support for customizable Iceberg objects
+ * such as {@link org.apache.iceberg.io.FileIO}.
+ * <p>
+ * If an object is serialized and needs to use Hadoop configuration,
+ * it is recommend for the object to implement this interface so that
+ * a serializable supplier of configuration can be provided instead of
+ * an actual Hadoop configuration which is not serializable.

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2678: Core: add HadoopConfigurable interface to serialize custom FileIO

Reply via email to