[jira] [Comment Edited] (MAPREDUCE-7101) Revisit behavior of JHS scan file behavior

Ewan Higgs (JIRA) Wed, 06 Jun 2018 06:46:21 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503311#comment-16503311
 ]


Ewan Higgs edited comment on MAPREDUCE-7101 at 6/6/18 1:45 PM:
---------------------------------------------------------------

 
{quote}integrate with cloud event sources which are set up to add a new event 
when a file is added to a container.
{quote}
While this might not be "fun" it might be the correct solution. Have the 
various cloud event sources (AWS Lambda, Google Cloud Functions, etc) been 
wrapped like S3/WASB/GCS, etc have been implemented as HCFSs? Is it required? 
Or do we just slurp up a kafka stream for the events and assume someone's made 
a bridge?
{quote}One thing to consider though: if the scanning includes subdirectories 
then listFiles(path, recursive=true) is orders of magnitude more efficient on 
S3A (and any other connector which can do bulk listings): we want to use that 
for any recursive polling.\{quote}
{quote}
 

HDFS-13616 (Batch listing of multiple directories) may also be relevant here 
for Hadoop.


was (Author: ehiggs):
{quote}integrate with cloud event sources which are set up to add a new event 
when a file is added to a container.\{quote}

While this might not be "fun" it might be the correct solution. Have the 
various cloud event sources (AWS Lambda, Google Cloud Functions, etc) been 
wrapped like S3/WASB/GCS, etc have been implemented as HCFSs? Is it required? 
Or do we just slurp up a kafka stream for the events and assume someone's made 
a bridge?

{quote}One thing to consider though: if the scanning includes subdirectories 
then listFiles(path, recursive=true) is orders of magnitude more efficient on 
S3A (and any other connector which can do bulk listings): we want to use that 
for any recursive polling.\{quote}

HDFS-13616 (Batch listing of multiple directories) may also be relevant here or 
Hadoop.

> Revisit behavior of JHS scan file behavior
> ------------------------------------------
>
>                 Key: MAPREDUCE-7101
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7101
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Priority: Critical
>
> Currently, the JHS scan directory if the modification of *directory* changed: 
> {code} 
>     public synchronized void scanIfNeeded(FileStatus fs) {
>       long newModTime = fs.getModificationTime();
>       if (modTime != newModTime) {
>         <... omitted some logics ...>
>         // reset scanTime before scanning happens
>         scanTime = System.currentTimeMillis();
>         Path p = fs.getPath();
>         try {
>           scanIntermediateDirectory(p);
> {code}
> This logic relies on an assumption that, the directory's modification time 
> will be updated if a file got placed under the directory.
> However, the semantic of directory's modification time is not consistent in 
> different FS implementations. For example, MAPREDUCE-6680 fixed some issues 
> of truncated modification time. And HADOOP-12837 mentioned on S3, the 
> directory's modification time is always 0.
> I think we need to revisit behavior of this logic to make it to more robustly 
> work on different file systems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (MAPREDUCE-7101) Revisit behavior of JHS scan file behavior

Reply via email to