[GitHub] [spark] dhruve commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery

GitBox Mon, 10 Jun 2019 08:22:12 -0700

dhruve commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for 
auto namespace discovery
URL: https://github.com/apache/spark/pull/24785#issuecomment-500456945
 
 
   I think I have tried to explain the use case in the linked jira/s, but it 
seems it isn't clear. 
   
   Let me try to explain the use case for which the change was implemented as 
described in the original jira:
   
   Storing table/hive partitions across namespaces isn't uncommon. For storing 
partitions of a given table across namespaces, a good design would dictate that 
storing related data in related namespaces. This can be either in the same 
namespace or a large namespace that is broken into sub namespaces for better 
namespace management. In this case, good design would dictate to use viewfs. 
Just because we have the option of selecting a namespace while creating a hive 
partition, choosing an unrelated namespace is a poor decision. While a single 
team or organization might do this for whatever reason, I don't think this is 
the norm or hadoop/hive encourage storage policies where to store related data 
in unrelated namespaces, especially when the data belongs to the same table.
   
   Because of Federation, the unrelated namespace happens to be stored on the 
same cluster as the other namespace. This, is okay.
   
   So the problem we wanted to achieve comes down to:
   Can we make spark figure it out for us, so the user doesn't need to know 
about the storage choices. 
   
   Question is how do we do this. We cannot do this, because spark doesn't know 
anything about how the data is stored across namespaces(this is where using 
viewfs would make total sense). 
   
   The use case which I refer to - Just because the data ended up being on the 
same cluster, we figure out the namespace information from the federation 
configs.
   Some issues that I see with the approach:
   
   1. It doesn't solve the issue. The user still has to know about the storage 
choices. If the data is stored on a different cluster, this solution doesn't 
work. Its just a hack to get around because of a poor design choice.
   2. We now get tokens for namespaces which the user isn't going to read/write 
from. This was broken and there was another patch on top of the change to avoid 
doing exactly that and mention namespaces explicitly
    - 
(https://github.com/apache/spark/commit/c3f285c939ba046de5171ada9c4bbb1a2589635d)
   3. If hadoop already figures out getting tokens for different namespaces 
uing viewfs, that is a better choice.
   4. It fails to launch spark with existing HDFS deployments while trying to 
create a path from NameServiceIDs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] dhruve commented on issue #24785: [SPARK-27937][CORE] Revert partial logic for auto namespace discovery

Reply via email to