[I] [Bug] dynamic-bucket.* configurations don't work on Spark [incubator-paimon]

via GitHub Thu, 23 Nov 2023 17:45:04 -0800


gnailJC opened a new issue, #2385:
URL: https://github.com/apache/incubator-paimon/issues/2385


   ### Search before asking
   
   - [X] I searched in the 
[issues](https://github.com/apache/incubator-paimon/issues) and found nothing 
similar.
   
   
   ### Paimon version
   
   
[paimon-spark-3.3-0.6-20231122.093342-69.jar](https://repository.apache.org/content/groups/snapshots/org/apache/paimon/paimon-spark-3.3/0.6-SNAPSHOT/paimon-spark-3.3-0.6-20231122.093342-69.jar)
   
   ### Compute Engine
   
   
https://www.apache.org/dyn/closer.lua/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
   
   ### Minimal reproduce step
   
   ```python
   spark.sql("""
           CREATE TABLE IF NOT EXISTS paimon.tdr.ods_tbl (
             _id STRING NOT NULL,
             update_time TIMESTAMP)
           USING paimon
           TBLPROPERTIES (
             'bucket' = '-1',
             'dynamic-bucket.assigner-parallelism' = '32',
             'dynamic-bucket.target-row-num' = '2000000',
             'merge-engine' = 'partial-update',
             'path' = '',
             'primary-key' = '_id',
             'tag.creation-period' = 'daily',
             'tag.num-retained-max' = '30',
             'write.merge-schema' = 'true',
             'write.merge-schema.explicit-cast' = 'true')
      """).show()
   
   
   schema = spark.sql('select * from tdr.ods_tbl limit 0').schema
   
   full_data_reader = (
       spark.read
       .format('mongodb')
       .schema(schema)
       .option('database', '')
       .option('collection', '')
       .option('connection.uri', '')
   )
   full_data_df = full_data_reader.load()
   
   full_data_writer = (
       full_data_df.write
       .option('write-buffer-size', '256MB')
       .option('target-file-size', '256MB')
       .option('num-sorted-run.stop-trigger', '2147483647')
       .option('sort-spill-threshold', '2')
       .option('write-buffer-spillable', 'true')
   )
   full_data_writer.save(
      'oss://**', format='paimon', mode='append'
   )
   ```
   
   full_data_df.count() almost 20 millions.
   
   However, 200+ buckets were generated in the end, not 
32(dynamic-bucket.assigner-parallelism) or 20 (20 millions / 
dynamic-bucket.target-row-num(2 million)) . 
   
   Is this as expected?
   
   ### What doesn't meet your expectations?
   
   `dynamic-bucket.assigner-parallelism`, `dynamic-bucket.target-row-num ` can 
control the number of buckets during Dynamic Bucket initialization.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug] dynamic-bucket.* configurations don't work on Spark [incubator-paimon]

Reply via email to