[jira] [Commented] (PARQUET-2347) Add interface layer between Parquet and Hadoop Configuration

ASF GitHub Bot (Jira) Sun, 24 Sep 2023 08:33:42 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768403#comment-17768403
 ]


ASF GitHub Bot commented on PARQUET-2347:
-----------------------------------------

wgtmac commented on code in PR #1141:
URL: https://github.com/apache/parquet-mr/pull/1141#discussion_r1335198227


##########
parquet-pig/src/main/java/org/apache/parquet/pig/TupleReadSupport.java:
##########
@@ -154,9 +172,9 @@ private static FieldSchema union(FieldSchema 
mergedFieldSchema, FieldSchema newF
 
   @Override
   public ReadContext init(InitContext initContext) {
-    Schema pigSchema = getPigSchema(initContext.getConfiguration());
-    RequiredFieldList requiredFields = 
getRequiredFields(initContext.getConfiguration());
-    boolean columnIndexAccess = 
initContext.getConfiguration().getBoolean(PARQUET_COLUMN_INDEX_ACCESS, false);
+    Schema pigSchema = getPigSchema(initContext.getConfig());

Review Comment:
   `initContext.getConfig` and `initContext.getConfiguration` would be 
confusing. Can we use a much clearer version, something like 
`initContext.getParquetConfiguration`?



##########
parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java:
##########
@@ -40,14 +42,22 @@
 public abstract class AbstractThriftWriteSupport<T> extends WriteSupport<T> {
   public static final String PARQUET_THRIFT_CLASS = "parquet.thrift.class";
   private static final Logger LOG = 
LoggerFactory.getLogger(AbstractThriftWriteSupport.class);
-  private static Configuration conf;
+  private static ParquetConfiguration conf;
 
   public static void setGenericThriftClass(Configuration configuration, 
Class<?> thriftClass) {
+    setGenericThriftClass(new HadoopParquetConfiguration(configuration), 
thriftClass);
+  }
+
+  public static void setGenericThriftClass(ParquetConfiguration configuration, 
Class<?> thriftClass) {
     conf = configuration;
     configuration.set(PARQUET_THRIFT_CLASS, thriftClass.getName());
   }
 
   public static Class getGenericThriftClass(Configuration configuration) {
+    return getGenericThriftClass(new 
HadoopParquetConfiguration(configuration));
+  }
+
+  public static Class<?> getGenericThriftClass(ParquetConfiguration 
configuration) {

Review Comment:
   Why does this overload have a different return type?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java:
##########
@@ -261,7 +263,7 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 
         LOG.debug("read value: {}", currentValue);
       } catch (RuntimeException e) {
-        throw new ParquetDecodingException(format("Can not read value at %d in 
block %d in file %s", current, currentBlock, reader.getPath()), e);
+        throw new ParquetDecodingException(format("Can not read value at %d in 
block %d in file %s", current, currentBlock, reader.getFile()), e);

Review Comment:
   If my previous statement is correct, then perhaps we do not need change in 
this line?



##########
pom.xml:
##########
@@ -547,6 +547,8 @@
             </excludeModules>
             <excludes>
               <exclude>${shade.prefix}</exclude>
+              <exclude>org.apache.parquet.hadoop.CodecFactory</exclude>

Review Comment:
   These lines worth adding some comments.



##########
parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java:
##########
@@ -333,13 +406,17 @@ public Builder copy(ParquetReadOptions options) {
 
     public ParquetReadOptions build() {
       if (codecFactory == null) {
-        codecFactory = HadoopCodecs.newFactory(0);
+        if (conf == null) {
+          codecFactory = HadoopCodecs.newFactory(0);
+        } else {
+          codecFactory = HadoopCodecs.newFactory(conf, 0);

Review Comment:
   It seems that original ParquetReadOptions does not require any Configuration 
parameter. Should we avoid adding this by using ParquetConfiguration internally 
in the HadoopCodecs?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ConfigurationUtil.java:
##########
@@ -41,4 +49,18 @@ public static Class<?> getClassFromConfig(Configuration 
configuration, String co
     }
   }
 
+  public static Configuration createHadoopConfiguration(ParquetConfiguration 
conf) {
+    if (conf == null) {
+      return new Configuration();
+    }
+    if (conf instanceof HadoopParquetConfiguration) {
+      return ((HadoopParquetConfiguration) conf).getConfiguration();
+    }
+    Configuration configuration = new Configuration();

Review Comment:
   When will it happen?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/InitContext.java:
##########
@@ -77,6 +87,10 @@ public Map<String, String> getMergedKeyValueMetaData() {
    * @return the configuration for this job
    */
   public Configuration getConfiguration() {
+    return ConfigurationUtil.createHadoopConfiguration(configuration);
+  }
+
+  public ParquetConfiguration getConfig() {

Review Comment:
   ```suggestion
     public ParquetConfiguration getParquetConfiguration() {
   ```
   
   WDYT?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java:
##########
@@ -246,9 +262,9 @@ protected CompressionCodec getCodec(CompressionCodecName 
codecName) {
         codecClass = Class.forName(codecClassName);
       } catch (ClassNotFoundException e) {
         // Try to load the class using the job classloader
-        codecClass = configuration.getClassLoader().loadClass(codecClassName);
+        codecClass = new 
Configuration(false).getClassLoader().loadClass(codecClassName);

Review Comment:
   Sorry I don't understand this line. Why not use `configuration` (i.e. 
ParquetConfiguration)?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java:
##########
@@ -246,9 +262,9 @@ protected CompressionCodec getCodec(CompressionCodecName 
codecName) {
         codecClass = Class.forName(codecClassName);
       } catch (ClassNotFoundException e) {
         // Try to load the class using the job classloader
-        codecClass = configuration.getClassLoader().loadClass(codecClassName);
+        codecClass = new 
Configuration(false).getClassLoader().loadClass(codecClassName);
       }
-      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, 
configuration);
+      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, 
ConfigurationUtil.createHadoopConfiguration(configuration));

Review Comment:
   If the codec still comes from Hadoop dependency, does it mean that we can 
only use uncompressed codec if we do not have them?
   
   cc @Fokko as I remember the effort to remove hadoop dependency from the 
parquet codec.



##########
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java:
##########
@@ -423,7 +435,7 @@ private static GenericData getDataModel(Configuration conf, 
Schema schema) {
 
     Class<? extends AvroDataSupplier> suppClass = conf.getClass(
         AVRO_DATA_SUPPLIER, SpecificDataSupplier.class, 
AvroDataSupplier.class);
-    return ReflectionUtils.newInstance(suppClass, conf).get();
+    return ReflectionUtils.newInstance(suppClass, 
ConfigurationUtil.createHadoopConfiguration(conf)).get();

Review Comment:
   Does it mean that we still cannot get rid of the hadoop dependency? Anyway 
we have to depend on Configuration class here.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java:
##########
@@ -167,13 +169,13 @@ public float getProgress() throws IOException, 
InterruptedException {
 
   public void initialize(ParquetFileReader reader, ParquetReadOptions options) 
{
     // copy custom configuration to the Configuration passed to the ReadSupport
-    Configuration conf = new Configuration();
-    if (options instanceof HadoopReadOptions) {
-      conf = ((HadoopReadOptions) options).getConf();
-    }
+    ParquetConfiguration conf = 
Objects.requireNonNull(options).getConfiguration();
     for (String property : options.getPropertyNames()) {
       conf.set(property, options.getProperty(property));
     }
+    for (Map.Entry<String, String> property : new Configuration()) {

Review Comment:
   CMIW, we do not need to address the Hadoop dependency in the parquet-hadoop 
module, right?



##########
parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java:
##########
@@ -254,29 +273,64 @@ public RecordMaterializer<T> prepareForRead(Configuration 
configuration,
         configuration);
   }
 
-  @SuppressWarnings("unchecked")
+  @Override
+  public RecordMaterializer<T> prepareForRead(ParquetConfiguration 
configuration,
+                                              Map<String, String> 
keyValueMetaData, MessageType fileSchema,
+                                              
org.apache.parquet.hadoop.api.ReadSupport.ReadContext readContext) {
+    ThriftMetaData thriftMetaData = 
ThriftMetaData.fromExtraMetaData(keyValueMetaData);
+    try {
+      initThriftClass(thriftMetaData, configuration);
+    } catch (ClassNotFoundException e) {
+      throw new RuntimeException("Cannot find Thrift object class for 
metadata: " + thriftMetaData, e);
+    }
+
+    // if there was not metadata in the file, get it from requested class
+    if (thriftMetaData == null) {
+      thriftMetaData = ThriftMetaData.fromThriftClass(thriftClass);
+    }
+
+    String converterClassName = configuration.get(RECORD_CONVERTER_CLASS_KEY, 
RECORD_CONVERTER_DEFAULT);
+    return getRecordConverterInstance(converterClassName, thriftClass,
+      readContext.getRequestedSchema(), thriftMetaData.getDescriptor(),
+      configuration);
+  }
+
   private static <T> ThriftRecordConverter<T> getRecordConverterInstance(
       String converterClassName, Class<T> thriftClass,
       MessageType requestedSchema, StructType descriptor, Configuration conf) {
-    Class<ThriftRecordConverter<T>> converterClass;
+    return getRecordConverterInstance(converterClassName, thriftClass, 
requestedSchema, descriptor, conf, Configuration.class);
+  }
+
+  private static <T> ThriftRecordConverter<T> getRecordConverterInstance(
+      String converterClassName, Class<T> thriftClass,
+      MessageType requestedSchema, StructType descriptor, ParquetConfiguration 
conf) {
+    return getRecordConverterInstance(converterClassName, thriftClass, 
requestedSchema, descriptor, conf, ParquetConfiguration.class);
+  }
+
+  @SuppressWarnings("unchecked")
+  private static <T1, T2> ThriftRecordConverter<T1> getRecordConverterInstance(

Review Comment:
   nit: could we give T1 and T2 better names?





> Add interface layer between Parquet and Hadoop Configuration
> ------------------------------------------------------------
>
>                 Key: PARQUET-2347
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2347
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Atour Mousavi Gourabi
>            Priority: Minor
>
> Parquet relies heavily on a few Hadoop classes, such as its Configuration 
> class, which is used throughout Parquet's reading and writing logic. If we 
> include our own interface for this, this could potentially allow users to use 
> Parquet's readers and writers without the Hadoop dependency later on.
> In order to preserve backward compatibility and avoid breaking downstream 
> projects, the constructors and methods using Hadoop's constructor should be 
> preserved for the time being, though I would favour deprecation in the near 
> future.
> This is part of an effort that has been [discussed on the dev mailing 
> list|https://lists.apache.org/thread/4wl0l3d9dkpx4w69jx3rwnjk034dtqr8].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2347) Add interface layer between Parquet and Hadoop Configuration

Reply via email to