tooptoop4 opened a new issue #1954:
URL: https://github.com/apache/hudi/issues/1954
i'm loading data from DMS and i don't want any partitions (i did not specify
hoodie.datasource.hive_sync.partition_fields since website says can leave
default empty)
```
/home/ec2-user/spark_home/bin/spark-submit --conf
"spark.hadoop.fs.s3a.proxy.host=redact" --conf
"spark.hadoop.fs.s3a.proxy.port=redact" --conf
"spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf
"spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars
"/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077
--deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar
--table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class
org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync
--hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf
hoodie.datasource.hive_sync.table=dmstest_multpk4 --hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
--hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false --target-base-path
s3a://redact/my2/multpk4 --tar
get-table dmstest_multpk4 --transformer-class
org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
--hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company
--hoodie-conf hoodie.datasource.write.partitionpath.field=sys_user
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tblhere >
multpk4.log
```
```
2020-08-12 11:31:11,186 [main] INFO
org.apache.hudi.client.AbstractHoodieWriteClient - Committed 20200812112840
2020-08-12 11:31:11,189 [main] INFO
org.apache.hudi.utilities.deltastreamer.DeltaSync - Commit 20200812112840
successful!
2020-08-12 11:31:11,194 [main] INFO
org.apache.hudi.utilities.deltastreamer.DeltaSync - Syncing target hoodie table
with hive table(dmstest_multpk4). Hive metastore URL
:jdbc:hive2://localhost:10000, basePath :s3a://redact/my2/multpk4
2020-08-12 11:31:11,960 [main] INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants
[[20200812112840__commit__COMPLETED]]
2020-08-12 11:31:14,264 [main] INFO org.apache.hudi.hive.HiveSyncTool -
Trying to sync hoodie table dmstest_multpk4 with base path
s3a://redact/my2/multpk4 of type COPY_ON_WRITE
2020-08-12 11:31:14,707 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Reading schema from
s3a://redact/my2/multpk4/mpark2/7ed7627c-6110-4d42-9df2-f3a6afe877df-0_187-25-15737_20200812112840.parquet
2020-08-12 11:31:15,330 [main] INFO org.apache.hudi.hive.HiveSyncTool -
Hive table dmstest_multpk4 is not found. Creating it
2020-08-12 11:31:15,337 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS
`redact`.`dmstest_multpk4`( `_hoodie_commit_time` string,
`_hoodie_commit_seqno` string, `_hoodie_record_key` string,
`_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id`
int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname`
string, `org_mnem` string, `org_parent` int, `percent_holding` double,
`group_company` string, `grp_ord_for_cln` string, `mkt_only` string,
`pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string,
`alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes`
string, `active` string, `version_no` int, `sys_date` bigint, `sys_user`
string, `create_date` bigint, `cntry_of_dom` string, `client` string,
`alert_acronym` string, `oneoff_client` string, `booking_domicile` string,
`booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW
FORMAT
SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION
's3a://redact/my2/multpk4'
2020-08-12 11:31:15,411 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Time taken to start SessionState and create Driver: 74 ms
2020-08-12 11:31:15,444 [main] INFO hive.ql.parse.ParseDriver - Parsing
command: CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk4`(
`_hoodie_commit_time` string, `_hoodie_commit_seqno` string,
`_hoodie_record_key` string, `_hoodie_partition_path` string,
`_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int,
`org_id` int, `org_name` string, `org_sname` string, `org_mnem` string,
`org_parent` int, `percent_holding` double, `group_company` string,
`grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string,
`show_shapes` string, `sec_code_pref` string, `alert_org_ref` string,
`swift_bic` string, `exec_breakdown` string, `notes` string, `active` string,
`version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint,
`cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client`
string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated`
bigint, `UserCreated` string) ROW FORMAT SERDE 'org.apa
che.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION
's3a://redact/my2/multpk4'
2020-08-12 11:31:16,131 [main] INFO hive.ql.parse.ParseDriver - Parse
Completed
2020-08-12 11:31:16,568 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Time taken to execute [CREATE EXTERNAL TABLE IF NOT EXISTS
`redact`.`dmstest_multpk4`( `_hoodie_commit_time` string,
`_hoodie_commit_seqno` string, `_hoodie_record_key` string,
`_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id`
int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname`
string, `org_mnem` string, `org_parent` int, `percent_holding` double,
`group_company` string, `grp_ord_for_cln` string, `mkt_only` string,
`pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string,
`alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes`
string, `active` string, `version_no` int, `sys_date` bigint, `sys_user`
string, `create_date` bigint, `cntry_of_dom` string, `client` string,
`alert_acronym` string, `oneoff_client` string, `booking_domicile` string,
`booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FOR
MAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED
AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION
's3a://redact/my2/multpk4']: 1157 ms
2020-08-12 11:31:16,574 [main] INFO org.apache.hudi.hive.HiveSyncTool -
Schema sync complete. Syncing partitions for dmstest_multpk4
2020-08-12 11:31:16,574 [main] INFO org.apache.hudi.hive.HiveSyncTool -
Last commit time synced was found to be null
2020-08-12 11:31:16,575 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Last commit time synced is not known, listing all partitions in
s3a://redact/my2/multpk4,FS :S3AFileSystem{uri=s3a://redact,
workingDir=s3a://redact/user/ec2-user, inputPolicy=normal, partSize=104857600,
enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536,
blockSize=33554432, multiPartThreshold=2147483647,
serverSideEncryptionAlgorithm='AES256',
blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@62765aec,
boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=2405,
available=2405, waiting=0}, activeCount=0},
unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@6f5bd362[Running,
pool size = 6, active threads = 0, queued tasks = 0, completed tasks = 6],
statistics {761530 bytes read, 320081 bytes written, 712 read ops, 0 large read
ops, 31 write ops}, metrics {{Context=S3AFileSystem}
{FileSystemId=db54a51b-e05e-4b3c-9140-240762a0c03d-redact
} {fsURI=s3a://redact/redact/sparkevents} {files_created=5} {files_copied=0}
{files_copied_bytes=0} {files_deleted=271} {fake_directories_deleted=0}
{directories_created=6} {directories_deleted=0} {ignored_errors=4}
{op_copy_from_local_file=0} {op_exists=53} {op_get_file_status=415}
{op_glob_status=0} {op_is_directory=38} {op_is_file=0} {op_list_files=271}
{op_list_located_status=0} {op_list_status=19} {op_mkdirs=5} {op_rename=0}
{object_copy_requests=0} {object_delete_requests=5} {object_list_requests=680}
{object_continue_list_requests=0} {object_metadata_requests=805}
{object_multipart_aborted=0} {object_put_bytes=320081} {object_put_requests=10}
{object_put_requests_completed=10} {stream_write_failures=0}
{stream_write_block_uploads=0} {stream_write_block_uploads_committed=0}
{stream_write_block_uploads_aborted=0} {stream_write_total_time=0}
{stream_write_total_data=320081} {object_put_requests_active=0}
{object_put_bytes_pending=0} {stream_write_block_uploads_active=0} {stream_
write_block_uploads_pending=4} {stream_write_block_uploads_data_pending=0}
{stream_read_fully_operations=0} {stream_opened=22}
{stream_bytes_skipped_on_seek=0} {stream_closed=22}
{stream_bytes_backwards_on_seek=437965} {stream_bytes_read=761530}
{stream_read_operations_incomplete=107} {stream_bytes_discarded_in_abort=0}
{stream_close_operations=22} {stream_read_operations=3020} {stream_aborted=0}
{stream_forward_seek_operations=0} {stream_backward_seek_operations=1}
{stream_seek_operations=1} {stream_bytes_read_in_close=8}
{stream_read_exceptions=0} }}
2020-08-12 11:31:34,438 [main] INFO org.apache.hudi.hive.HiveSyncTool -
Storage partitions scan complete. Found 271
2020-08-12 11:31:34,476 [main] INFO org.apache.hudi.hive.HiveSyncTool - New
Partitions [AAB, redactlist]
2020-08-12 11:31:34,476 [main] INFO org.apache.hudi.hive.HoodieHiveClient -
Adding partitions 271 to table dmstest_multpk4
2020-08-12 11:31:34,477 [main] ERROR org.apache.hudi.hive.HiveSyncTool - Got
runtime exception when hive syncing
org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for
table dmstest_multpk4
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:187)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:126)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:460)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:402)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:235)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:123)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:380)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Partition key parts [] does
not match with partition values [AAB]. Check partition strategy.
at
org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
at
org.apache.hudi.hive.HoodieHiveClient.getPartitionClause(HoodieHiveClient.java:182)
at
org.apache.hudi.hive.HoodieHiveClient.constructAddPartitions(HoodieHiveClient.java:166)
at
org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:141)
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:182)
... 19 more
2020-08-12 11:31:34,513 [main] INFO
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Shut down
deltastreamer
2020-08-12 11:31:34,535 [main] INFO
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down
all executors
```
```
aws s3 ls s3://redact/my2/multpk4/
PRE .hoodie/
PRE AAB/
PRE CC/
PRE DD/
...etc
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]