[ 
https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13741536#comment-13741536
 ] 

Andrew Wang commented on HDFS-5096:
-----------------------------------

bq. Usually partitions in Hive are new directories. So every 5 or 10 or 15 mins 
a new directory is added along with new data. Hence, the ability to 
automatically cache new files seems redundant.

I tried to explain this in an earlier comment, but let me elaborate. This 
request comes from Todd and our Impala team. Without auto-caching of new files, 
the abstraction of a "cached partition" is broken.

A Hive user might issue an command like {{ALTER TABLE CACHE PARTITION}} and 
then expect all future queries run against that partition to operate on cached 
data. The issue is that users can keep adding new files to the partition via 
direct HDFS commands, without informing the metastore. This leads a partition 
that, although it is marked in the metastore as cached, is not entirely cached.

Even for just HDFS, I think users would prefer the abstraction of a cached path 
over "cached blocks of the path at the time the request was issued". I feel 
like the latter might even be more complicated, since we'd need to persist the 
list of blocks across NN restarts.

I don't think this needs to be addressed in the first phase, since there's the 
workaround of making users re-issue the caching request when new data is added. 
I also agree that quota needs to be checked throughout; if auto-caching should 
not violate the user's quota. Bikas' suggestion of "all-or-nothing" could come 
into play here, but for now I'd rather just show errors and leave it to the 
user to clean up their caching requests.
                
> Automatically cache new data added to a cached path
> ---------------------------------------------------
>
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>
> For some applications, it's convenient to specify a path to cache, and have 
> HDFS automatically cache new data added to the path without sending a new 
> caching request or a manual refresh command.
> One example is new data appended to a cached file. It would be nice to 
> re-cache a block at the new appended length, and cache new blocks added to 
> the file.
> Another example is a cached Hive partition directory, where a user can drop 
> new files directly into the partition. It would be nice if these new files 
> were cached.
> In both cases, this automatic caching would happen after the file is closed, 
> i.e. block replica is finalized.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to