[
https://issues.apache.org/jira/browse/HIVE-16591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025885#comment-16025885
]
anishek commented on HIVE-16591:
--------------------------------
[~sankarh]
hive.repl.replica.functions.root.dir :: The property name explicitly contains
the "replica" in it as its a property that has to be set on the replica
warehouse, all the other "repl" properties were primarily for the
primary/source warehouse. The name gives information that the property has to
be set only on the replica warehouse side.
ReplChangeManager.addFile
True i dont think we need that condition at all, will submit a new patch with
that condition removed.
destinationResourceUri:
In the replication sub system of hive : checksum's for a files are associated
on a give cluster, Since the source and destination warehouse can have
different HDFS block sizes and hence will have different checksums for same
files, this can still be used as a differentiating factor in the path on the
replica warehouse but will overload the derivation of checksum in the context
of whether its present on the source or destination hdfs cluster. Hence the use
of nano timestamp which will allow us to achieve the same, since possibility
of, multiple create function events for the same function in the same database,
all processed within the same nanosecond would not happen. The idea is not to
deduplicate jar copy, since we will not achieve that with the current setup as
multiple functions using the same function binaries would all be stored on the
destination under different directory names{noformat} {basedir}/{database
name}/{function name}/{time in nano}/{jar file name} {noformat} There might be
however another problem that just occurred to me, in that if there are 100
functions defined with the same jars, on the replica warehouse the class path
will get additional 100 URI paths all having the same jar definition
unnecessarily taking up PermSpace, so it might still be better to have
overloaded meaning associated with checksums if we can save class path from
being overloaded with same definition multiple times.
bq. I saw the repl dump doesn't copy the function binaries to the dump dir,
FunctionSerializer just updates the original path with checksum.But, the
ReplCopyTask by load from a target cluster/warehouse, tries to read/copy from
the original path/cmpath corresponding to the function binaries. Won't it cause
any access violation or permission issues?
The mechanism is similar to how table files will get copied, on creation of
function we just modify the resource uri paths to include checksum to make it
accessible in case the source deletes the functions. Currently replication
requires access to the source HDFS cluster to copy relevant data and most of
that is being done by distcp internally which i think is going to run as a
super user privilege to allow copy of files.
> DR for function Binaries on HDFS
> ---------------------------------
>
> Key: HIVE-16591
> URL: https://issues.apache.org/jira/browse/HIVE-16591
> Project: Hive
> Issue Type: Sub-task
> Components: HiveServer2
> Affects Versions: 3.0.0
> Reporter: anishek
> Assignee: anishek
> Attachments: HIVE-16591.1.patch
>
>
> # We have to make sure that during incremental dump we dont allow functions
> to be copied if they have local filesystem "file://" resources. -- depends
> how much system side work we want to do, We are going to explicitly provide a
> caveat for replicating functions where in, only functions created "using"
> clause will be replicated and the "using" clause prohibits creating functions
> with the local "file://" resources and hence doing additional checks when
> doing repl dump might not be required.
> # We have to make sure that during the bootstrap / incremental dump we append
> the namenode host + port if functions are created without the fully
> qualified location of uri on hdfs, not sure how this would play for S3 or
> WASB filesystem.
> # We have to copy the binaries of a function resource list on CREATE / DROP
> FUNCTION . The change management file system has to keep a copy of the binary
> when a DROP function is called, to provide capability of updating binary
> definition for existing functions along with DR. An example of list of steps
> is given in doc (ReplicateFunctions.pdf ) attached in parent Issue.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)