agaddis02 opened a new issue, #724:
URL: https://github.com/apache/iceberg-go/issues/724

   ### Feature Request / Improvement
   
   # Context
   
   If you want to write your own parquet files and only use iceberg to handle 
the metadata, you are only  left with the option (for the most part) of 
leveraging the `ReplaceDataFiles` function.
   
   This function takes in a list of existing files and a list of new file paths 
to override that previous data with.
   
   This function works fine for the most part, but the function includes a scan 
in it which means it's not actually taking your word that your new parquet 
files match the table schema.
   
   This scan proves to be problematic in some cases when you are writing files 
very fast and leveraging multipart uploads. You know the location of all files, 
know they are valid parquet files, but the commit has the possibility to return 
an error because at the time of commit the file might not be fully available.
   
   the error looks something like this at commit time: `failed to replace data 
files: error encountered during file conversion: parquet: could not read 8 
bytes from end of file`.
   
   # Solution
   We have tested this out in vendor code and opened a fork that adds a new 
function.
   
   `ReplaceDataFiles` is scanning your file paths to try and ensure the schema 
of said files match the schema of the table you are inputting them into.
   
   We, and I would assume a lot of people writing their own parquet files, 
don't need this. Our ingestion framework guarantees we will never get a 
incorrect parquet file, and we also have access to our Parquet Schema and Arrow 
Schema for the entirety of the ingestion.
   
   So I can build data files directly and would much rather just pass my own 
datafiles to this function, as I know the files will eventually be available 
and they will be correct. all this is doing is telling the metadata where to 
look at said file, there is no real harm in committing before that file is 
actually available unless you are querying it right away and it happens to not 
be available.
   
   This also speeds up the commit time tremendously as this library doesn't 
need to go through scan all of the files for every single commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to