dhruve commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery URL: https://github.com/apache/spark/pull/24785#issuecomment-500456945 I think I have tried to explain the use case in the linked jira/s, but it seems it isn't clear. Let me try to explain the use case for which the change was implemented as described in the original jira: Storing table/hive partitions across namespaces isn't uncommon. For storing partitions of a given table across namespaces, a good design would dictate that storing related data in related namespaces. This can be either in the same namespace or a large namespace that is broken into sub namespaces for better namespace management. In this case, good design would dictate to use viewfs. Just because we have the option of selecting a namespace while creating a hive partition, choosing an unrelated namespace is a poor decision. While a single team or organization might do this for whatever reason, I don't think this is the norm or hadoop/hive encourage storage policies where to store related data in unrelated namespaces, especially when the data belongs to the same table. Because of Federation, the unrelated namespace happens to be stored on the same cluster as the other namespace. This, is okay. So the problem we wanted to achieve comes down to: Can we make spark figure it out for us, so the user doesn't need to know about the storage choices. Question is how do we do this. We cannot do this, because spark doesn't know anything about how the data is stored across namespaces(this is where using viewfs would make total sense). The use case which I refer to - Just because the data ended up being on the same cluster, we figure out the namespace information from the federation configs. Some issues that I see with the approach: 1. It doesn't solve the issue. The user still has to know about the storage choices. If the data is stored on a different cluster, this solution doesn't work. Its just a hack to get around because of a poor design choice. 2. We now get tokens for namespaces which the user isn't going to read/write from. This was broken and there was another patch on top of the change to avoid doing exactly that and mention namespaces explicitly - (https://github.com/apache/spark/commit/c3f285c939ba046de5171ada9c4bbb1a2589635d) 3. If hadoop already figures out getting tokens for different namespaces uing viewfs, that is a better choice. 4. It fails to launch spark with existing HDFS deployments while trying to create a path from NameServiceIDs.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
