cajil opened a new issue, #6047:
URL: https://github.com/apache/hudi/issues/6047
**Describe the problem you faced**
I am trying to use hdfsparquetimport tool which is available within hudi-cli
to bootstrap a table to hudi. The table to be bootstrapped is in parquet file
format. While doing so, I am facing multiple issues due to schema mismatch
between what is in the parquet file versus what is defined in avro schema file.
Note: Tests are run in AWS EMR environment.
Avro schema definition
`{
"name": "bootstraptest",
"type": "record",
"fields": [
{
"name": "CREATEDBY",
"type": "string"
},
{
"name": "ID",
"type": "int"
},
{
"name": "CLIENT_ID",
"type": "int"
}
]
}`
When I load the parquet file to a spark dataframe and inspect the schema, it
is like below:
` |-- CREATEDBY: string (nullable = true)
|-- ID: decimal(12,0) (nullable = true)
|-- _CONTEXT_ID_: decimal(12,0) (nullable = true)
`
Command used:
`hdfsparquetimport --upsert false --srcPath
s3://test/imported_data/TESTBOOTSTRAP/ --targetPath s3://test/hudi_converted/
--tableName TESTBOOTSTRAP --tableType COPY_ON_WRITE --rowKeyField ID
--partitionPathField CLIENT_ID --parallelism 1500 --schemaFilePath
s3://test/scripts/test.avsc --format parquet --sparkMemory 6g --retry 1`
I have tried with int and long datatypes in avro schema for decimal columns.
Both are resulting in the following error.
`Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
value at 1 in block 0 in file .......
Caused by: java.lang.UnsupportedOperationException:
org.apache.parquet.avro.AvroConverters$FieldIntegerConverter
or
Caused by: java.lang.UnsupportedOperationException:
org.apache.parquet.avro.AvroConverters$FieldLongConverter`
I have also tried to define avro logicalType for decimal columns as
suggested in few blogs:
`{
"name": "bootstraptest",
"type": "record",
"fields": [
{
"name": "CREATEDBY",
"type": "string"
},
{
"name": "ID",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 12,
"scale": 0
}
},
{
"name": "CLIENT_ID",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 12,
"scale": 0
}
}
]
}`
This time data is generated, but the partition path is in weird format
<img width="497" alt="Screenshot 2022-07-05 at 3 41 28 PM"
src="https://user-images.githubusercontent.com/102692442/177305356-37e0f8ad-d89d-46cd-a945-cf76f8abeb48.png">
I have also tried to reproduce the same in my local machine and seeing the
same behaviour.
`hdfsparquetimport --upsert false --srcPath
/Users/user1/hudi-res/parquet_data/ --targetPath
/Users/user1/hudi-res/hudi_converted/ --tableName BOOTSTRAPTEST --tableType
COPY_ON_WRITE --rowKeyField ID --partitionPathField CLIENT_ID --parallelism 50
--schemaFilePath /Users/user1/hudi-res/schema/test.avsc --format parquet
--sparkMemory 2G --retry 1 --sparkMaster local`
**Expected behavior**
What I am trying is a very basic functionality of what is expected out of
hdfsparquetimport tool. Trying to convert table files in parquet format to
hudi format. Please let me know if something is wrong with my config/command.
**Environment Description**
* Hudi version : 0.10.1
* Spark version : 3.2.0
* Storage : S3
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]