[GitHub] [hudi] cajil opened a new issue, #6047: [SUPPORT] Hdfsparquetimport tool is not picking rowKeyField and partitionPathField in correct format

GitBox Tue, 05 Jul 2022 03:23:54 -0700


cajil opened a new issue, #6047:
URL: https://github.com/apache/hudi/issues/6047


   
   
   **Describe the problem you faced**
   
   I am trying to use hdfsparquetimport tool which is available within hudi-cli 
to bootstrap a table to hudi. The table to be bootstrapped is in parquet file 
format. While doing so, I am facing multiple issues due to schema mismatch 
between what is in the parquet file versus what is defined in avro schema file.
   
   Note: Tests are run in AWS EMR environment.
   
   Avro schema definition
   `{
     "name": "bootstraptest",
     "type": "record",
     "fields": [
       {
         "name": "CREATEDBY",
         "type": "string"
       },
       {
         "name": "ID",
         "type": "int"
       },
       {
         "name": "CLIENT_ID",
         "type": "int"
       }
     ]
   }`
   
   When I load the parquet file to a spark dataframe and inspect the schema, it 
is like below:
   ` |-- CREATEDBY: string (nullable = true)
    |-- ID: decimal(12,0) (nullable = true)
    |-- _CONTEXT_ID_: decimal(12,0) (nullable = true)
   `
   Command used:
   `hdfsparquetimport --upsert false --srcPath 
s3://test/imported_data/TESTBOOTSTRAP/ --targetPath s3://test/hudi_converted/ 
--tableName TESTBOOTSTRAP --tableType COPY_ON_WRITE --rowKeyField ID 
--partitionPathField CLIENT_ID --parallelism 1500 --schemaFilePath 
s3://test/scripts/test.avsc --format parquet --sparkMemory 6g --retry 1`
   
   I have tried with int and long datatypes in avro schema for decimal columns. 
Both are resulting in the following error.
   `Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 1 in block 0 in file .......
   Caused by: java.lang.UnsupportedOperationException: 
org.apache.parquet.avro.AvroConverters$FieldIntegerConverter
   or
   Caused by: java.lang.UnsupportedOperationException: 
org.apache.parquet.avro.AvroConverters$FieldLongConverter`
   
   I have also tried to define avro logicalType for decimal columns as 
suggested in few blogs:
   `{
     "name": "bootstraptest",
     "type": "record",
     "fields": [
       {
         "name": "CREATEDBY",
         "type": "string"
       },
       {
         "name": "ID",
         "type": {
           "type": "bytes",
           "logicalType": "decimal",
           "precision": 12,
           "scale": 0
         }
       },
       {
         "name": "CLIENT_ID",
         "type": {
           "type": "bytes",
           "logicalType": "decimal",
           "precision": 12,
           "scale": 0
         }
       }
     ]
   }`
   
   This time data is generated, but the partition path is in weird format
   <img width="497" alt="Screenshot 2022-07-05 at 3 41 28 PM" 
src="https://user-images.githubusercontent.com/102692442/177305356-37e0f8ad-d89d-46cd-a945-cf76f8abeb48.png";>
   
   I have also tried to reproduce the same in my local machine and seeing the 
same behaviour.
   `hdfsparquetimport --upsert false --srcPath 
/Users/user1/hudi-res/parquet_data/ --targetPath 
/Users/user1/hudi-res/hudi_converted/ --tableName BOOTSTRAPTEST --tableType 
COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 50 
--schemaFilePath /Users/user1/hudi-res/schema/test.avsc --format parquet 
--sparkMemory 2G --retry 1 --sparkMaster local`
   
   **Expected behavior**
   
   What I am trying is a very basic functionality of what is expected out of 
hdfsparquetimport tool.  Trying to convert table files in parquet format to 
hudi format. Please let me know if something is wrong with my config/command.
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.2.0
   
   * Storage : S3 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] cajil opened a new issue, #6047: [SUPPORT] Hdfsparquetimport tool is not picking rowKeyField and partitionPathField in correct format

Reply via email to