MichaelTiemannOSC opened a new issue #4472:
URL: https://github.com/apache/iceberg/issues/4472


   We are seeing a problem that looks a lot like 
https://github.com/apache/iceberg/issues/4168
   
   > So essentially what we observed is that the Flink S3 FileIO failed to 
upload the data files due to file not exist, but the missing data file is 
incorrectly tracked in iceberg manifest. So the subsequent query against the 
partition will fail as the claimed data file cannot be found.
   > 
   > Our iceberg was setup on AWS S3 with versioned bucket and we can confirm 
that the
   > 
   > * data file never get uploaded, no version exists of given path
   > * the data file is being tracked in iceberg and we do see such non-exist 
data file in iceberg metadata query
   > 
   > This result in some nasty behavior where we need to reconcile the manifest 
state based on what exists in underlying file system, which is not possible. So 
we had to drop the partition as temporary work around. But I think it would be 
helpful to understand
   > 
   > why do we run into the issue where iceberg commits before s3 confirm the 
data file is uploaded?
   > 
   > In the meantime, we are trying to see if we can have a simple repro of the 
issue
   > 
   > CC @szehon-ho
   
   But we are not using Flink (to my knowledge).  We are using S3 and 
SQLAlchemy (our application is written in Python, and we do inserts via pandas 
to_sql function with a callable).  Here's a metadata file.  Note that the avro 
files (such as 
snap-2224412739198497176-1-5d1a6f7d-eaff-48d9-b932-d6d27b3049eb.avro) does not 
exist in the S3 location, but the JSON file referencing them does.  It is 
presently a problem for only this one table.  All other tables are behaving as 
expected.  We are trying to understand both the source of this corruption as 
well as the remedy.
   
   ```
   {
     "format-version" : 1,
     "table-uuid" : "21eb737d-e5e5-4e85-b0ca-9ba96ce278e7",
     "location" : 
"s3a://osc-datacommons-s3-bucket-dev02/data/sandbox.db/dera_num",
     "last-updated-ms" : 1648842902547,
     "last-column-id" : 10,
     "schema" : {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ {
         "id" : 1,
         "name" : "adsh",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 2,
         "name" : "tag",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 3,
         "name" : "version",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 4,
         "name" : "coreg",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 5,
         "name" : "ddate",
         "required" : false,
         "type" : "timestamp"
       }, {
         "id" : 6,
         "name" : "qtrs",
         "required" : false,
         "type" : "int"
       }, {
         "id" : 7,
         "name" : "uom",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 8,
         "name" : "value",
         "required" : false,
         "type" : "double"
       }, {
         "id" : 9,
         "name" : "footnote",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 10,
         "name" : "srcdir",
         "required" : false,
         "type" : "string"
       } ]
     },
     "current-schema-id" : 0,
     "schemas" : [ {
       "type" : "struct",
       "schema-id" : 0,
       "fields" : [ {
         "id" : 1,
         "name" : "adsh",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 2,
         "name" : "tag",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 3,
         "name" : "version",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 4,
         "name" : "coreg",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 5,
         "name" : "ddate",
         "required" : false,
         "type" : "timestamp"
       }, {
         "id" : 6,
         "name" : "qtrs",
         "required" : false,
         "type" : "int"
       }, {
         "id" : 7,
         "name" : "uom",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 8,
         "name" : "value",
         "required" : false,
         "type" : "double"
       }, {
         "id" : 9,
         "name" : "footnote",
         "required" : false,
         "type" : "string"
       }, {
         "id" : 10,
         "name" : "srcdir",
         "required" : false,
         "type" : "string"
       } ]
     } ],
     "partition-spec" : [ {
       "name" : "srcdir",
       "transform" : "identity",
       "source-id" : 10,
       "field-id" : 1000
     } ],
     "default-spec-id" : 0,
     "partition-specs" : [ {
       "spec-id" : 0,
       "fields" : [ {
         "name" : "srcdir",
         "transform" : "identity",
         "source-id" : 10,
         "field-id" : 1000
       } ]
     } ],
     "last-partition-id" : 1000,
     "default-sort-order-id" : 0,
     "sort-orders" : [ {
       "order-id" : 0,
       "fields" : [ ]
     } ],
     "properties" : {
       "write.format.default" : "ORC"
     },
     "current-snapshot-id" : 2224412739198497176,
     "snapshots" : [ {
       "snapshot-id" : 2224412739198497176,
       "timestamp-ms" : 1648842902499,
       "summary" : {
         "operation" : "append",
         "changed-partition-count" : "0",
         "total-records" : "0",
         "total-files-size" : "0",
         "total-data-files" : "0",
         "total-delete-files" : "0",
         "total-position-deletes" : "0",
         "total-equality-deletes" : "0"
       },
       "manifest-list" : 
"s3a://osc-datacommons-s3-bucket-dev02/data/sandbox.db/dera_num/metadata/snap-2224412739198497176-1-5d1a6f7d-eaff-48d9-b932-d6d27b3049eb.avro",
       "schema-id" : 0
     } ],
     "snapshot-log" : [ {
       "timestamp-ms" : 1648842902499,
       "snapshot-id" : 2224412739198497176
     } ],
     "metadata-log" : [ ]
   }
   ```
   
   We have only seen this problem with this one table, but the problem is 
consistently reproducible for this table when we drop the table and re-create 
it by name.  A differently named table with the same table schema and same data 
load procedure works as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to