rubenssoto edited a comment on issue #4135:
URL: https://github.com/apache/hudi/issues/4135#issuecomment-983072106


   Hello @xiarixiaoyao and @vinothchandar 
   
   I am testing clustering for a demonstration and I decide to reproduce the 
Vinoth test on my cluster.
   
   Machine type: r5.4xlarge
   Number of nodes: 8
   
   Same dataset as Vinoth
   s3://amazon-reviews-pds/parquet/
   
   hudi_options = {
       'hoodie.table.name': 'amazon_reviews_hudi',
       'hoodie.datasource.write.recordkey.field': 'review_id',
       'hoodie.datasource.write.precombine.field': 'review_id',
       'hoodie.datasource.write.table.name': 'amazon_reviews_hudi',
       'hoodie.datasource.write.operation': 'bulk_insert',
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
       'hoodie.parquet.small.file.limit': '805306368',
       'hoodie.parquet.max.file.limit': '1073741824',
       'hoodie.parquet.block.size': '805306368',
       'hoodie.metadata.enable': 'true',
       "hoodie.bulk_insert.shuffle.parallelism": "30",
       "hoodie.clustering.inline": "true",
       "hoodie.clustering.inline.max.commits": "1",
       "hoodie.layout.optimize.enable":"true",
       "hoodie.clustering.plan.strategy.sort.columns": 
"product_id,customer_id,review_date"
   }
   
   The job took 35 minutes.
   
   <img width="1790" alt="Captura de Tela 2021-11-30 às 19 08 24" 
src="https://user-images.githubusercontent.com/36298331/144137443-e6825a37-7915-4db6-81d0-45c62bd3ac95.png";>
   
   <img width="1792" alt="Captura de Tela 2021-11-30 às 19 44 39" 
src="https://user-images.githubusercontent.com/36298331/144140247-903f9884-c2b7-4026-89c8-5562aa7bb7d2.png";>
   
   I think the z order jobs are not running in parallel.
   
   Is it possible to launch this jobs in different threads?
   createOptimizedDataFrameByMapValue at 
RDDSpatialCurveOptimizationSortPartitioner.java:83
   
   for better resource utilization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to