[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2017-10-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188391#comment-16188391
 ] 

ASF GitHub Bot commented on APEXMALHAR-2274:


vrozov closed pull request #597: APEXMALHAR-2274 merge #490
URL: https://github.com/apache/apex-malhar/pull/597
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126752#comment-16126752
 ] 

ASF GitHub Bot commented on APEXMALHAR-2274:


tweise closed pull request #490: APEXMALHAR-2274 Handle large number of files 
for AbstractFileInputOpe…
URL: https://github.com/apache/apex-malhar/pull/490
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654940#comment-15654940
 ] 

ASF GitHub Bot commented on APEXMALHAR-2274:


GitHub user mattqzhang opened a pull request:

https://github.com/apache/apex-malhar/pull/490

APEXMALHAR-2274 Handle large number of files for AbstractFileInputOpe…

@PramodSSImmaneni Please review

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mattqzhang/apex-malhar APEXMALHAR-2274

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/apex-malhar/pull/490.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #490


commit cb37761b62761d0e74f1c9b8c7c47491735e5421
Author: Matt Zhang 
Date:   2016-11-10T19:38:51Z

APEXMALHAR-2274 Handle large number of files for AbstractFileInputOperator




> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-08 Thread Munagala V. Ramanath (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647751#comment-15647751
 ] 

Munagala V. Ramanath commented on APEXMALHAR-2274:
--

The benefit of such an interface is not clear at this point; perhaps when the 
need for polymorphic use of such an interface arises, we can refactor.


> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-08 Thread Tushar Gosavi (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647455#comment-15647455
 ] 

Tushar Gosavi commented on APEXMALHAR-2274:
---

we  could derive a common interface which could be used in both the operators 
for scanning files. Both are conceptually doing the same thing. we could have 
different implementaion of the interface, where one could just get the paths 
and other can get path as well as status.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-11-02 Thread Matt Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15629893#comment-15629893
 ] 

Matt Zhang commented on APEXMALHAR-2274:


The scanner in FileSplitterInput is more complicate. It retrieves the file 
status and supports regex. In our case we only need a light weight process to 
get the paths for all the files in directory. So from performance view it's 
better to use a dedicated lightweight process.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-10-26 Thread Tushar Gosavi (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607573#comment-15607573
 ] 

Tushar Gosavi commented on APEXMALHAR-2274:
---

you can take a look at FileSplitterInput which run the scanner in a separate 
thread and pass the data to main operator thread using thread safe queue. You 
can reuse the same approach, may be we can combine the scanner for both the 
operators.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (APEXMALHAR-2274) AbstractFileInputOperator gets killed when there are a large number of files.

2016-10-25 Thread Matt Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/APEXMALHAR-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606612#comment-15606612
 ] 

Matt Zhang commented on APEXMALHAR-2274:


- FileSystem.listStatusIterator() not available till 2.7

- FileSystem.listLocatedStatus(), uses an array to store the files, which is 
coming from the listStatus() call.

- FileSystem.listFiles() available in 2.6, but using stack, to store the 
RemoteIterator referring to the array from listLocatedStatus().

- FileContext.listStatus(), actually it also refers to the array from 
fs.listStatus()

To resolve this we need to use the reconciler pattern, spawn a worker thread 
during the Setup() of AbstractInputOperator, and read data into a thread safe 
queue. And the input operator will read from this queue asynchronously.

> AbstractFileInputOperator gets killed when there are a large number of files.
> -
>
> Key: APEXMALHAR-2274
> URL: https://issues.apache.org/jira/browse/APEXMALHAR-2274
> Project: Apache Apex Malhar
>  Issue Type: Bug
>Reporter: Munagala V. Ramanath
>Assignee: Matt Zhang
>
> When there are a large number of files in the monitored directory, the call 
> to DirectoryScanner.scan() can take a long time since it calls 
> FileSystem.listStatus() which returns the entire list. Meanwhile, the 
> AppMaster deems this operator hung and restarts it which again results in the 
> same problem.
> It should use FileSystem.listStatusIterator() [in Hadoop 2.7.X] or 
> FileSystem.listFiles() [in 2.6.X] or other similar calls that return
> a remote iterator to limit the number files processed in a single call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)