Thanks for the heads up on this. It sounds like this is not a concern for
most people, but we should definitely add it to our maintenance docs to
call it out in a warning. Would you like to open a PR for that?

On Fri, Sep 11, 2020 at 3:45 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Because the RemoveOrphanFilesAction uses Filesystem.list, the paths of
> files found in the file system can have an authority included in them based
> on the core-site.xml. This is determined
> when listing the files so the entries stored in the metadata tables do not
> necessarily have to match. URIs will have the same scheme and path but can
> have a potentially
> different authority. This means when doing a string matching join in
> Spark, the files found on the system will not match those listed in the
> metadata table and the
> action will determine that the files are no longer required and delete
> them. This leads to removing all the files that are listed with a different
> authority.
>
> This will only affect you if you have changed authorities between writing
> and running RemoveOrphanFilesAction I believe.
> We are doing more investigation but because of the potential for data loss
> I thought it important to share with the dev-list.
>
> If your authority has not changed, or will not change there should be no
> issues.
>
> Thanks for your time,
> Russ
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to