[GitHub] [hudi] aznwarmonkey edited a comment on issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table

GitBox Sun, 09 Jan 2022 06:36:45 -0800


aznwarmonkey edited a comment on issue #4541:
URL: https://github.com/apache/hudi/issues/4541#issuecomment-1008283680



   > 
   
   Hi, 
   
   After making the suggested changes as mentioned above, I am getting this 
exception thrown immediately:
   
   ```bash
   py4j.protocol.Py4JJavaError: An error occurred while calling o154.save.
   : org.apache.hudi.exception.HoodieException: Invalid value of Type.
           at 
org.apache.hudi.common.model.WriteOperationType.fromValue(WriteOperationType.java:86)
           at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:83)
           at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
   ```
   The updated config looks like the following
   
   ```python
           hudi_options = {
               'hoodie.table.name': table_name,
               'hoodie.datasource.write.keygenerator.class':
                   'org.apache.hudi.keygen.ComplexKeyGenerator',
               'hoodie.datasource.write.recordkey.field': keys,
               'hoodie.datasource.write.partitionpath.field': 
','.join(partitions),
               'hoodie.datasource.write.hive_style_partitioning': True,
               'hoodie.datasource.write.table.name': table_name,
               'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
               'hoodie.datasource.write.precombine.field': timestamp_col,
               'hoodie.index.type': 'BLOOM',
               'hoodie.consistency.check.enabled': True,
               'hoodie.parquet.small.file.limit': 134217728,
               'hoodie.parquet.max.file.size': 1073741824,
               'write.bulk_insert.shuffle_by_partition': True,
               'hoodie.copyonwrite.insert.auto.split': False,
               'hoodie.copyonwrite.insert.split.size': 10000000,
               'hoodie.datasource.write.row.writer.enable': True,
               'hoodie.bulkinsert.sort.mode': 'PARTITION_SORT',
               'hoodie.bulkinsert.shuffle.parallelism': 50,
               'hoodie.cleaner.commits.retained': '2',
               'hoodie.clean.async': True,
           }
   
           df.write.format('org.apache.hudi') \
               .option('hoodie.datasource.write.operation', 'bulk_ingest') \
               .options(**hudi_options).mode('append').save(output_path)
   ```
   
   If i change `hoodie.datasource.write.operation` to `upsert` then I do not 
get the error. (I am currently rerunning with `upsert` as i type this to ensure 
the process fully completes`
   
   A couple of questions and what I am hoping to accomplish:
   - For whatever reason unless, if i use ` 
'hoodie.datasource.write.row.writer.enable': True`  with `bulk_insert` the 
write operation is significantly slower, it is about 2x-3x slower 
   - `write.parquet.block.size` is this in bytes or mb? i believe the default 
is 120: https://hudi.apache.org/docs/configurations/#writeparquetblocksize
   - when writing, I noticed I am getting a bunch of small parquet files 
~20-50mb which i am trying to avoid, the desired size is  at least 128MB. 
Tuning small.file.limit and max.file.size seem to have no noticeable impact, 
which is why having clustering is important (as the next time it writes to the 
same partition the hope is to combine the small files together)
   
   
   **UPDATE** I would like to report the above code snippet worked using 
`upsert`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] aznwarmonkey edited a comment on issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table

Reply via email to