lucprosa opened a new issue #2303:
URL: https://github.com/apache/hudi/issues/2303
Hi all,
We're trying to use the HDFSParquetImporter to import data from parquet
(source) to Hudi (target) on S3 but we're facing some performance problems.
Here's important considerations about our data:
- Parquet size on S3: 536.8 GiB
- Parquet count: 8785
- Total rows: > 8 billions of rows
- Not partitioned
About our partition rules on Hudi:
- We're using Multi-level partitions - organization/year/month/day
- We have thousands of organizations
Our data schema in AVRO format:
`{
"type": "record",
"name": "UsageFact",
"doc": "Usage Fact",
"fields": [
{
"name": "sk_usage_id",
"type": "string"
},
{
"name": "sk_comm_capability_id",
"type": "string"
},
{
"name": "time",
"type": "string"
},
{
"name": "mt_load_time",
"type": "string"
},
{
"name": "direction",
"type": "string"
},
{
"name": "channel",
"type": "string"
},
{
"name": "provider",
"type": "string"
},
{
"name": "metric",
"type": "string"
},
{
"name": "sk_comm_capability_name",
"type": "string"
},
{
"name": "sk_operation_id",
"type": "string"
},
{
"name": "sk_operation_name",
"type": "string"
},
{
"name": "country",
"type": "string"
},
{
"name": "subcategory",
"type": "string"
},
{
"name": "category",
"type": "string"
},
{
"name": "quantity",
"type": "int"
},
{
"name": "partition_path",
"type": "string"
}
]
}
`
The column "partition_path" is the definition of our partition rules (for
example: "organization=ABC/year=2020/month=01/day=01").
So we're trying to execute HDFSParquetImport:
`hdfsparquetimport --upsert false --srcPath "[PARQUET_SOURCE_PATH]"
--targetPath "[HUDI_TARGET_PATH]" --tableName [TABLE_NAME] --tableType
COPY_ON_WRITE --rowKeyField [ROW_IDENTIFIER] --partitionPathField
"partition_path" --parallelism 5000 --schemaFilePath "[AVRO SCHEMA]" --format
parquet --sparkMemory 20g --retry 3
`
The problem:
In our first test, the importer took 33 minutes to import 20 millions rows.
So we're concerned about use this importer to ingest our 8 billions of rows.
We tried to change some performance arguments (sparkMemory and parallelism)
but without any good results. The Spark job created by the importer only use
30% of the cluster resources. We just can't make this job to use more resources
in our cluster.
We already tried to use Spark and BULK INSERT to write on Hudi but the
performance was worst than use the importer (one million rows in more than one
hour).
So our questions are:
- How can we tune this importer? How can we allocate more resources to this
job on yarn?
- Can we run multiple parallel importers on the same Hudi table using bulk
mode?
- Is there other good alternative to import a large amount of data to Hudi?
What is the best option in terms of performance?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]