[GitHub] [iceberg] ksmatharoo commented on issue #1617: Support relative paths in Table Metadata

via GitHub Thu, 20 Apr 2023 01:45:27 -0700


ksmatharoo commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-1515953429


   > I don't think it is a good idea in general to use relative paths. We 
recently had an issue where using a `hdfs` location without authority caused a 
user's data to be deleted by the `RemoveOrphanFiles` action because the 
resolution of the table root changed. The main problem is that places in 
Iceberg would need to have some idea of "equivalent" paths and path resolution. 
Full URIs are much easier to work with and more reliable.
   > 
   > But there is still a way to do both. Catalogs and tables can inject their 
own `FileIO` implementation, which is what is used to open files. That can do 
any resolution that you want based on environment. So you could use an 
implementation that allows you to override a portion of the file URI and read 
it from a different underlying location. I think that works better overall 
because there are no mistakes about equivalent URIs, but you can still read a 
table copy without rewriting the metadata.
   
   @rdblue  We tried injecting own FileIO which will replace the table/metadata 
path prefix with new location, this works till reading table metadata but while 
reading parquet files it struck in BatchDataReader.java in following function, 
Please provide your thoughts on this if there is some other way of achieving 
this.
    
   
   
   
   protected CloseableIterator<ColumnarBatch> open(FileScanTask task) {
       String filePath = task.file().path().toString();
       LOG.debug("Opening data file {}", filePath);
   
       // update the current file for Spark's filename() function
       InputFileBlockHolder.set(filePath, task.start(), task.length());
   
       Map<Integer, ?> idToConstant = constantsMap(task, expectedSchema());
   
      /*** below given code line is causing issue because its searching name 
given in metadata in the map which will be replaced by custom FileIO, changing 
this line to InputFile inputFile = table.io().newInputFile(filePath); is making 
it work but this remove the encryption logic, in short we couldn't make it work 
with only Custom FileIO
   **/
   
       InputFile inputFile = getInputFile(filePath);
       Preconditions.checkNotNull(inputFile, "Could not find InputFile 
associated with FileScanTask");
       SparkDeleteFilter deleteFilter =
           task.deletes().isEmpty()
               ? null
               : new SparkDeleteFilter(filePath, task.deletes(), counter());
   
       return newBatchIterable(
               inputFile,
               task.file().format(),
               task.start(),
               task.length(),
               task.residual(),
               idToConstant,
               deleteFilter)
           .iterator();
     }


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ksmatharoo commented on issue #1617: Support relative paths in Table Metadata

Reply via email to