Hi Nishith and All, I think I figured out what was blocking the processing of files for me. The code snippet that I had sent before is indeed the issue. I have raised a new issue and added a few details at: https://github.com/apache/incubator-hudi/issues/776 (https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F776&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D) Can someone please have a look and advise what is going wrong?
Thank you, Kabeer. On Jul 3 2019, at 11:38 am, Kabeer Ahmed <[email protected]> wrote: > Hi Nishith, > > Please find the latest commit data in the gist at: > https://gist.github.com/smdahmed/5d811cb4833243a11ac09b9dc61e5b4d > (https://link.getmailspring.com/link/[email protected]/1?redirect=https%3A%2F%2Fgist.github.com%2Fsmdahmed%2F5d811cb4833243a11ac09b9dc61e5b4d&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D) > For your convenience, I have also copy pasted it below. Any help is highly > appreciated. > > Thanks > Kabeer. > > 20190702161629.commit: > { > "partitionToWriteStats" : { > "2018/05/30" : [ { > "fileId" : "39cff0df-24e4-45b8-bff5-9b4f41c4096a", > "path" : > "2018/05/30/39cff0df-24e4-45b8-bff5-9b4f41c4096a_0_20190702161629.parquet", > "prevCommit" : "20190702161417", > "numWrites" : 11614, > "numDeletes" : 0, > "numUpdateWrites" : 5, > "numInserts" : 3, > "totalWriteBytes" : 848480, > "totalWriteErrors" : 0, > "tempPath" : null, > "partitionPath" : "2018/05/30", > "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, > "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, > "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, > "totalRollbackBlocks" : 0 > } ], > "2018/05/31" : [ { > "fileId" : "4f5514e8-d57c-4c6e-be8f-c3448051c956", > "path" : > "2018/05/31/4f5514e8-d57c-4c6e-be8f-c3448051c956_1_20190702161629.parquet", > "prevCommit" : "null", > "numWrites" : 10430, > "numDeletes" : 0, > "numUpdateWrites" : 0, > "numInserts" : 10430, > "totalWriteBytes" : 820723, > "totalWriteErrors" : 0, > "tempPath" : null, > "partitionPath" : "2018/05/31", > "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, > "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, > "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, > "totalRollbackBlocks" : 0 > } ] > }, > "compacted" : false, > "extraMetadataMap" : { > "ROLLING_STAT" : "{\n \"partitionToRollingStats\" : {\n \"2018/05/29\" : {\n > \"235bd794-790b-48e7-b9ea-956149db1dce\" : {\n \"fileId\" : > \"235bd794-790b-48e7-b9ea-956149db1dce\",\n \"inserts\" : 2,\n \"upserts\" : > 0,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 443797\n }\n },\n \"2018/05/30\" : {\n > \"39cff0df-24e4-45b8-bff5-9b4f41c4096a\" : {\n \"fileId\" : > \"39cff0df-24e4-45b8-bff5-9b4f41c4096a\",\n \"inserts\" : 23220,\n > \"upserts\" : 5,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 848282\n }\n },\n \"2018/05/31\" : {\n > \"4f5514e8-d57c-4c6e-be8f-c3448051c956\" : {\n \"fileId\" : > \"4f5514e8-d57c-4c6e-be8f-c3448051c956\",\n \"inserts\" : 10430,\n > \"upserts\" : 0,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 820723\n }\n }\n },\n \"actionType\" : > \"commit\"\n}" > }, > "extraMetadata" : { > "ROLLING_STAT" : "{\n \"partitionToRollingStats\" : {\n \"2018/05/29\" : {\n > \"235bd794-790b-48e7-b9ea-956149db1dce\" : {\n \"fileId\" : > \"235bd794-790b-48e7-b9ea-956149db1dce\",\n \"inserts\" : 2,\n \"upserts\" : > 0,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 443797\n }\n },\n \"2018/05/30\" : {\n > \"39cff0df-24e4-45b8-bff5-9b4f41c4096a\" : {\n \"fileId\" : > \"39cff0df-24e4-45b8-bff5-9b4f41c4096a\",\n \"inserts\" : 23220,\n > \"upserts\" : 5,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 848282\n }\n },\n \"2018/05/31\" : {\n > \"4f5514e8-d57c-4c6e-be8f-c3448051c956\" : {\n \"fileId\" : > \"4f5514e8-d57c-4c6e-be8f-c3448051c956\",\n \"inserts\" : 10430,\n > \"upserts\" : 0,\n \"deletes\" : 0,\n \"totalInputWriteBytesToDisk\" : 0,\n > \"totalInputWriteBytesOnDisk\" : 820723\n }\n }\n },\n \"actionType\" : > \"commit\"\n}" > }, > "totalScanTime" : 0, > "totalCreateTime" : 2439, > "totalUpsertTime" : 2450, > "totalCompactedRecordsUpdated" : 0, > "totalLogFilesCompacted" : 0, > "totalLogFilesSize" : 0, > "fileIdAndRelativePaths" : { > "4f5514e8-d57c-4c6e-be8f-c3448051c956" : > "2018/05/31/4f5514e8-d57c-4c6e-be8f-c3448051c956_1_20190702161629.parquet", > "39cff0df-24e4-45b8-bff5-9b4f41c4096a" : > "2018/05/30/39cff0df-24e4-45b8-bff5-9b4f41c4096a_0_20190702161629.parquet" > }, > "totalRecordsDeleted" : 0, > "totalLogRecordsCompacted" : 0 > } > > 20190702161750.clean: > Objavro.schema > {"type":"record","name":"HoodieCleanMetadata","namespace":"com.uber.hoodie.avro.model","fields":[{"name":"startCleanTime","type":{"type":"string","avro.java.string":"String"}},{"name":"timeTakenInMillis","type":"long"},{"name":"totalFilesDeleted","type":"int"},{"name":"earliestCommitToRetain","type":{"type":"string","avro.java.string":"String"}},{"name":"partitionMetadata","type":{"type":"map","values":{"type":"record","name":"HoodieCleanPartitionMetadata","fields":[{"name":"partitionPath","type":{"type":"string","avro.java.string":"String"}},{"name":"policy","type":{"type":"string","avro.java.string":"String"}},{"name":"deletePathPatterns","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"successDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}},{"name":"failedDeleteFiles","type":{"type":"array","items":{"type":"string","avro.java.string":"String"}}}]},"avro.java.string":"String"}}]} > > 20190702161847.inflight: > { > "partitionToWriteStats" : { > "2018/05/31" : [ { > "fileId" : "4f5514e8-d57c-4c6e-be8f-c3448051c956", > "path" : null, > "prevCommit" : "20190702161629", > "numWrites" : 0, > "numDeletes" : 0, > "numUpdateWrites" : 2, > "numInserts" : 0, > "totalWriteBytes" : 0, > "totalWriteErrors" : 0, > "tempPath" : null, > "partitionPath" : null, > "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, > "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, > "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, > "totalRollbackBlocks" : 0 > } ] > }, > "compacted" : false, > "extraMetadataMap" : { }, > "totalScanTime" : 0, > "totalCreateTime" : 0, > "totalUpsertTime" : 0, > "totalCompactedRecordsUpdated" : 0, > "totalLogFilesCompacted" : 0, > "totalLogFilesSize" : 0, > "extraMetadata" : { }, > "fileIdAndRelativePaths" : { > "4f5514e8-d57c-4c6e-be8f-c3448051c956" : null > }, > "totalRecordsDeleted" : 0, > "totalLogRecordsCompacted" : 0 > } > > 20190702162055.inflight: > > { > "partitionToWriteStats" : { }, > "compacted" : false, > "extraMetadataMap" : { }, > "totalRecordsDeleted" : 0, > "totalLogRecordsCompacted" : 0, > "totalScanTime" : 0, > "totalCreateTime" : 0, > "totalUpsertTime" : 0, > "totalCompactedRecordsUpdated" : 0, > "totalLogFilesCompacted" : 0, > "totalLogFilesSize" : 0, > "fileIdAndRelativePaths" : { }, > "extraMetadata" : { } > } > > On Jul 3 2019, at 8:30 am, Kabeer Ahmed <[email protected]> wrote: > > Hi Nishith, > > > > Thank you for the quick respnose. I shall try to send the commit metadata > > at the earliest. I hope the commit metadata you are looking for is the one > > within .hoodie/ directory and not the ones that is archived. > > And there are inflight and commit metadata. I am taking that you want to > > look into the one inflight. Shall revert back with further details. > > Thanks > > Kabeer. > > > > On Jul 3 2019, at 2:19 am, nishith agarwal <[email protected]> wrote: > > > Kabir, > > > > > > Could you share the content of your commit metadata ? You can list the > > > timeline, find the latest commit in the timeline, perform a cat and paste > > > the results (that you can share). > > > > > > Thanks, > > > Nishith > > > > > > On Tue, Jul 2, 2019 at 4:53 PM Kabeer Ahmed <[email protected]> wrote: > > > > Hi Vinoth and other HUDI Experts, > > > > I am stuck while processing inserts into HUDI. The process picks up CSV > > > > files and loads them into HUDI. The process seems to be stuck at: > > > > https://github.com/apache/incubator-hudi/blob/master/hoodie-client/src/main/java/com/uber/hoodie/table/HoodieCopyOnWriteTable.java#L679 > > > > Log is below: > > > > > > > > 2019-07-02 22:43:31,875 [main] INFO > > > > com.uber.hoodie.table.HoodieCopyOnWriteTable - AvgRecordSize => > > > > 9223372036854775807 > > > > 2019-07-02 22:43:31,969 [main] INFO > > > > com.uber.hoodie.table.HoodieCopyOnWriteTable - For partitionPath : > > > > 2018/05/30 Small Files => [SmallFile {location=HoodieRecordLocation > > > > {commitTime=20190702161750, > > > > fileId=39cff0df-24e4-45b8-bff5-9b4f41c4096a}, > > > > sizeBytes=435362}] > > > > 2019-07-02 22:43:31,969 [main] INFO > > > > com.uber.hoodie.table.HoodieCopyOnWriteTable - After small file > > > > assignment: > > > > unassignedInserts => 8, totalInsertBuckets => 2147483647, > > > > recordsPerBucket > > > > => 0 > > > > Looking at the last line in the log: "unassignedInserts => 8, > > > > totalInsertBuckets => 2147483647, recordsPerBucket => 0", this causes > > > > the > > > > below code to loop for quite long causing heap issues. > > > > > > > > logger.info( > > > > "After small file assignment: unassignedInserts => " + > > > > totalUnassignedInserts > > > > + ", totalInsertBuckets => " + insertBuckets + ", recordsPerBucket => " > > > > + insertRecordsPerBucket); > > > > for (int b = 0; b < insertBuckets; b++) { > > > > bucketNumbers.add(totalBuckets); > > > > recordsPerBucket.add(totalUnassignedInserts / insertBuckets); > > > > BucketInfo bucketInfo = new BucketInfo(); > > > > bucketInfo.bucketType = BucketType.INSERT; > > > > bucketInfoMap.put(totalBuckets, bucketInfo); > > > > totalBuckets++; > > > > } > > > > Has someone seen the issue? Do I need to file a bug or it is something > > > > to > > > > do with my misconfiguration? > > > > > > > > Any help is highly appreciated. > > > > Thanks > > > > Kabeer. > > > > > > > > > > > >
