[ 
https://issues.apache.org/jira/browse/NIFI-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro D'Armiento updated NIFI-6462:
----------------------------------------
    Description: 
h2. Current Situation

ListHDFS is designed to be (only) the entry point of a data integration 
pipeline, and therefore can only be triggered on a cron or time base.
h2. Improvement Proposal

ListHDFS should be able to be used as part of your pipeline even if you do not 
expect to have it as the entry point. To obtain it:
 * It has to be triggerable
 * Trigger flowfile should be able to bring the listing directory as an 
attribute
 * Some logic, such as the "skip the last file in the listing directory" should 
be made optional
 * Since the processor will work on a 1:N semantic (1 input trigger flowfile, N 
output flowfiles) it would be nice to support fragmentation attributes (for 
example for subsequent merge operations)
 ** It would be also useful to support different fragmentation strategies, in 
order to support multiple user cases. For example, it should be possible to 
select:
 *** A "one for all" fragmentation strategy which will create a single 
fragmentation group. Therefore, all files will have the same 
fragment.identifier, the same fragment.count, equal to the total number N of 
listed files, and fragment.index ∈ [0, N).
 *** A "per subdir" fragmentation strategy which will create different 
fragmentation groups, one for each scanned subdirectory of the given path. 
Therefore, for each subfolder, flowfiles will have a specific 
fragment.identifier, fragment.count will be, for each flowfile, equal to the 
number Ni of files in the i-th directory, and fragment.index ∈ [0, Ni).

  was:
h2. Current Situation

ListHDFS is designed to be (only) the entry point of a data integration 
pipeline, and therefore can only be triggered on a cron or time base.
h2. Improvement Proposal

ListHDFS should be able to be used as part of your pipeline even if you do not 
expect to have it as the entry point. To obtain it:
 * It has to be triggerable
 * Trigger flowfile should be able to bring the listing directory as an 
attribute
 * Some logic, such as the "skip the last file in the listing directory" should 
be made optional
 * Since the processor will work on a 1:N semantic (1 input trigger flowfile, N 
output flowfiles) it would be nice to support fragmentation attributes (for 
example for subsequent merge operations)
 * It would be also useful to support different fragmentation strategies, in 
order to support multiple user cases. For example, it should be possible to 
select:
 * A "one for all" fragmentation strategy which will create a single 
fragmentation group. Therefore, all files will have the same 
fragment.identifier, the same fragment.count, equal to the total number N of 
listed files, and fragment.index ∈ [0, N).
 * A "per subdir" fragmentation strategy which will create different 
fragmentation groups, one for each scanned subdirectory of the given path. 
Therefore, for each subfolder, flowfiles will have a specific 
fragment.identifier, fragment.count will be, for each flowfile, equal to the 
number Ni of files in the i-th directory, and fragment.index ∈ [0, Ni).


> ListHDFS should be triggerable
> ------------------------------
>
>                 Key: NIFI-6462
>                 URL: https://issues.apache.org/jira/browse/NIFI-6462
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.9.2
>            Reporter: Alessandro D'Armiento
>            Priority: Minor
>
> h2. Current Situation
> ListHDFS is designed to be (only) the entry point of a data integration 
> pipeline, and therefore can only be triggered on a cron or time base.
> h2. Improvement Proposal
> ListHDFS should be able to be used as part of your pipeline even if you do 
> not expect to have it as the entry point. To obtain it:
>  * It has to be triggerable
>  * Trigger flowfile should be able to bring the listing directory as an 
> attribute
>  * Some logic, such as the "skip the last file in the listing directory" 
> should be made optional
>  * Since the processor will work on a 1:N semantic (1 input trigger flowfile, 
> N output flowfiles) it would be nice to support fragmentation attributes (for 
> example for subsequent merge operations)
>  ** It would be also useful to support different fragmentation strategies, in 
> order to support multiple user cases. For example, it should be possible to 
> select:
>  *** A "one for all" fragmentation strategy which will create a single 
> fragmentation group. Therefore, all files will have the same 
> fragment.identifier, the same fragment.count, equal to the total number N of 
> listed files, and fragment.index ∈ [0, N).
>  *** A "per subdir" fragmentation strategy which will create different 
> fragmentation groups, one for each scanned subdirectory of the given path. 
> Therefore, for each subfolder, flowfiles will have a specific 
> fragment.identifier, fragment.count will be, for each flowfile, equal to the 
> number Ni of files in the i-th directory, and fragment.index ∈ [0, Ni).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to