lucprosa opened a new issue #2303:
URL: https://github.com/apache/hudi/issues/2303


   Hi all,
   We're trying to use the HDFSParquetImporter to import data from parquet 
(source) to Hudi (target) on S3 but we're facing some performance problems.
   
   Here's important considerations about our data:
   - Parquet size on S3: 536.8 GiB
   - Parquet count: 8785
   - Total rows: > 8 billions of rows
   - Not partitioned
   
   About our partition rules on Hudi:
   - We're using Multi-level partitions - organization/year/month/day
   - We have thousands of organizations
   
   Our data schema in AVRO format:
   `{
     "type": "record",
     "name": "UsageFact",
     "doc": "Usage Fact",
     "fields": [
       {
         "name": "sk_usage_id",
         "type": "string"
       },
       {
         "name": "sk_comm_capability_id",
         "type": "string"
       },
       {
         "name": "time",
         "type": "string"
       },
       {
         "name": "mt_load_time",
         "type": "string"
       },
       {
         "name": "direction",
         "type": "string"
       },
       {
         "name": "channel",
         "type": "string"
       },
       {
         "name": "provider",
         "type": "string"
       },
       {
         "name": "metric",
         "type": "string"
       },
       {
         "name": "sk_comm_capability_name",
         "type": "string"
       },
       {
         "name": "sk_operation_id",
         "type": "string"
       },
       {
         "name": "sk_operation_name",
         "type": "string"
       },
       {
         "name": "country",
         "type": "string"
       },
       {
         "name": "subcategory",
         "type": "string"
       },
       {
         "name": "category",
         "type": "string"
       },
       {
         "name": "quantity",
         "type": "int"
       },
       {
         "name": "partition_path",
         "type": "string"
       }
     ]
   }
   `
   
   The column "partition_path" is the definition of our partition rules (for 
example: "organization=ABC/year=2020/month=01/day=01").
   
   So we're trying to execute HDFSParquetImport:
   
   `hdfsparquetimport --upsert false --srcPath "[PARQUET_SOURCE_PATH]" 
--targetPath "[HUDI_TARGET_PATH]" --tableName [TABLE_NAME] --tableType 
COPY_ON_WRITE --rowKeyField [ROW_IDENTIFIER] --partitionPathField 
"partition_path" --parallelism 5000 --schemaFilePath "[AVRO SCHEMA]" --format 
parquet --sparkMemory 20g --retry 3
   `
   
   The problem:
   
   In our first test, the importer took 33 minutes to import 20 millions rows. 
So we're concerned about use this importer to ingest our 8 billions of rows. 
   We tried to change some performance arguments (sparkMemory and parallelism) 
but without any good results. The Spark job created by the importer only use 
30% of the cluster resources. We just can't make this job to use more resources 
in our cluster.
   
   We already tried to use Spark and BULK INSERT to write on Hudi but the 
performance was worst than use the importer (one million rows in more than one 
hour).
   
   So our questions are:
   
   - How can we tune this importer? How can we allocate more resources to this 
job on yarn?
   - Can we run multiple parallel importers on the same Hudi table using bulk 
mode?
   - Is there other good alternative to import a large amount of data to Hudi? 
What is the best option in terms of performance?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to