Re: Referencing data files in manifest files

Vivekanand Vellanki Tue, 18 May 2021 10:16:45 -0700

Is it possible to make the FileIO implementation extensible for a schema?

For e.g. for schema hdfs://, can I ensure that Iceberg uses my custom
implementation of FileIO at run time?


On Tue, May 18, 2021 at 9:45 PM Daniel Weeks <[email protected]> wrote:

> Hey Vivek,
>
> The file_path per spec is technically just a string, but the
> representation is expected to be a URI.
>
> How this URI is interpreted is really up to the FileIO implementation.  So
> for example, the most common FileIO implementation is probably
> HadoopFileIO, which is going to use whatever file system scheme mapping
> you've defined in your configuration (typically via core-site.xml).
>
> For the Azure case (I'm not very familiar with this), it looks like
> AdlFileSystem is the Hadoop FileSystem implementation.  So, if you map wasb
> -> AdlFileSystem, then you would want to use the URI format you described.
>
> There are more custom FileIO implementations (like S3FileIO), that are
> more specific about URI representations, but HadoopFileIO approach is
> probably more common at this point and relies on how Hadoop will resolve
> the URI.
>
> The only other thing I would note is that at this point the paths still
> need to be fully qualified (though there are some discussions ongoing about
> relative paths).
>
> Hope that helps,
> -Dan
>
>
>
> On Thu, May 13, 2021 at 5:30 AM Vivekanand Vellanki <[email protected]>
> wrote:
>
>> Hi,
>>
>> We are trying to create Iceberg tables on ADLS. What is the format for
>> referencing data files in ADLS from Manifest files?
>>
>> We are seeing Spark use something like:
>> wasb://<container>@account/<file path>
>>
>> Is there a standard for how data files should be referenced within
>> manifest files?
>>
>> Thanks
>> Vivek
>>
>>

Re: Referencing data files in manifest files

Reply via email to