Alan Jackoway created NIFI-2705:
-----------------------------------
Summary: ListHDFS Cannot Be Re-run
Key: NIFI-2705
URL: https://issues.apache.org/jira/browse/NIFI-2705
Project: Apache NiFi
Issue Type: Bug
Components: Core Framework, Documentation & Website
Affects Versions: 1.0.0
Reporter: Alan Jackoway
I have a use case where every day I want to go through a directory in HDFS and
do something to the files more than a month old.
I was trying to do this with a flow like ListHDFS -> RouteOnAttribute
(hdfs.lastModified) -> FetchHDFS -> Processing.
However, after I ran it once, old files were not pulled any more. I turned on
debug logging and got this:
{noformat}
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9]
o.apache.nifi.processors.hadoop.ListHDFS
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] Found a total of 3 files in
HDFS
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9]
o.apache.nifi.processors.hadoop.ListHDFS
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] Of the 3 files found in HDFS,
0 are listable
2016-08-30 06:15:17,473 DEBUG [Timer-Driven Process Thread-9]
o.apache.nifi.processors.hadoop.ListHDFS
ListHDFS[id=d80a1ceb-0156-1000-595d-978dcf53ecb6] There is no data to list.
Yielding.
{noformat}
It turns out that ListHDFS maintains state called {{latestTimestampListed}}
that prevents it from re-listing files unless you change the directory being
listed. At a minimum, that should be mentioned in the docs on ListHDFS. Better
would be to make it configurable more like GetHDFS.
In my case I think I can change to using GetHDFS without causing trouble, but
the behavior of ListHDFS was surprising to me, and as far as I can tell is not
documented anywhere.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)