[ 
https://issues.apache.org/jira/browse/HIVE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904640#comment-13904640
 ] 

Brock Noland commented on HIVE-860:
-----------------------------------

bq. Does this work for jars on HDFS that have been added using the ADD JAR 
functionality?

Yes jars added via the mechanism are also cached.

bq. So when a non-local jar is added by a session, it gets copied locally to 
the session resource directory. But if the local copy of the jar has the same 
file name/md5 hash/mtime as what is already saved in the user's distributed 
cache, then this should work right?

This patch uses sha1 + file size to ensure the files are the same. In reality 
the file size check is just to ensure the jar is complete as sha1 should be 
unique enough for our purposes.


> Persistent distributed cache
> ----------------------------
>
>                 Key: HIVE-860
>                 URL: https://issues.apache.org/jira/browse/HIVE-860
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.12.0
>            Reporter: Zheng Shao
>            Assignee: Brock Noland
>             Fix For: 0.13.0
>
>         Attachments: HIVE-860.patch, HIVE-860.patch, HIVE-860.patch, 
> HIVE-860.patch, HIVE-860.patch
>
>
> DistributedCache is shared across multiple jobs, if the hdfs file name is the 
> same.
> We need to make sure Hive put the same file into the same location every time 
> and do not overwrite if the file content is the same.
> We can achieve 2 different results:
> A1. Files added with the same name, timestamp, and md5 in the same session 
> will have a single copy in distributed cache.
> A2. Filed added with the same name, timestamp, and md5 will have a single 
> copy in distributed cache.
> A2 has a bigger benefit in sharing but may raise a question on when Hive 
> should clean it up in hdfs.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to