[jira] [Commented] (HIVE-16591) DR for function Binaries on HDFS

anishek (JIRA) Thu, 25 May 2017 23:41:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025885#comment-16025885
 ]


anishek commented on HIVE-16591:
--------------------------------

[~sankarh] 

 hive.repl.replica.functions.root.dir :: The property name explicitly contains 
the "replica" in it as its a property that has to be set on the replica 
warehouse, all the other "repl" properties were primarily for the 
primary/source warehouse. The name gives information that the property has to 
be set only on the replica warehouse side.

ReplChangeManager.addFile
 True i dont think we need that condition at all, will submit a new patch with 
that condition removed. 

destinationResourceUri:
In the replication sub system of hive : checksum's for a files are associated 
on a give cluster, Since the source and destination warehouse can have 
different HDFS block sizes and hence will have different checksums for same 
files, this can still be used as a differentiating factor in the path on the 
replica warehouse but will overload the derivation of checksum in the context 
of whether its present on the source or destination hdfs cluster. Hence the use 
of nano timestamp which will allow us to achieve the same, since possibility 
of, multiple create function events for the same function in the same database, 
 all processed within the same nanosecond would not happen.  The idea is not to 
deduplicate jar copy, since we will not achieve that with the current setup as 
multiple functions using the same function binaries would all be stored on the 
destination under different directory names{noformat} {basedir}/{database 
name}/{function name}/{time in nano}/{jar file name} {noformat}  There might be 
however another problem that just occurred to me, in that if there are 100 
functions defined with the same jars, on the replica warehouse the class path 
will get additional 100 URI paths all having the same jar definition 
unnecessarily taking up PermSpace, so it might still be better to have 
overloaded meaning associated with checksums if we can save class path from 
being overloaded with same definition multiple times. 

bq. I saw the repl dump doesn't copy the function binaries to the dump dir, 
FunctionSerializer just updates the original path with checksum.But, the 
ReplCopyTask by load from a target cluster/warehouse, tries to read/copy from 
the original path/cmpath corresponding to the function binaries. Won't it cause 
any access violation or permission issues?

The mechanism is similar to how table files will get copied, on creation of 
function we just modify the resource uri paths to include checksum to make it 
accessible in case the source deletes the functions. Currently replication 
requires access to the source HDFS cluster to copy relevant data and most of 
that is being done by distcp internally which i think is going to run as a 
super user privilege to allow copy of files. 




> DR for function Binaries on HDFS 
> ---------------------------------
>
>                 Key: HIVE-16591
>                 URL: https://issues.apache.org/jira/browse/HIVE-16591
>             Project: Hive
>          Issue Type: Sub-task
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>         Attachments: HIVE-16591.1.patch
>
>
> # We have to make sure that during incremental dump we dont allow functions 
> to be copied if they have local filesystem "file://" resources.  -- depends 
> how much system side work we want to do, We are going to explicitly provide a 
> caveat for replicating functions where in, only functions created "using" 
> clause will be replicated and the "using" clause prohibits creating functions 
> with the local "file://"  resources and hence doing additional checks when 
> doing repl dump might not be required. 
> # We have to make sure that during the bootstrap / incremental dump we append 
> the namenode host + port  if functions are created without the fully 
> qualified location of uri on hdfs, not sure how this would play for S3 or 
> WASB filesystem.
> # We have to copy the binaries of a function resource list on CREATE / DROP 
> FUNCTION . The change management file system has to keep a copy of the binary 
> when a DROP function is called, to provide capability of updating binary 
> definition for existing functions along with DR. An example of list of steps 
> is given in doc (ReplicateFunctions.pdf ) attached in  parent Issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16591) DR for function Binaries on HDFS

Reply via email to