Anup,

The List/Fetch HDFS would allow you to pull new data from HDFS without 
destroying it.

But it sounds like what you want here is to also pull from disk without 
removing it. The GetFile processor does
not currently keep any state about what it's pulled in. It would likely be a 
fairly easy modification to GetFile, if it is
reading from a local filesystem. If reading from a network-mounted file system 
like nfs then it gets much more complex, as
the state would have to be shared across the cluster, as with ListHDFS.

A few possible solutions that I could offer in the meantime (I realize none is 
great but should work):

1. If you can move the data, you could use GetFile and then immediately route 
to PutFile. PutFile would then put the data to a different directory.

2. Similar to #1, you could use GetFile -> UpdateAttribute -> PutFile, and put 
the data back to the same directory but use UpdateAttribute to change
the filename, perhaps to "${filename}.pulled" and then configure GetFile to 
ignore files that end with ".pulled"

3. Use GetFile and configure it with a "Maximum File Age" of say 10 minutes, 
and only run every 5 minutes. Then, use DetectDuplicate
and throw away any duplicate. The downside here is that you would potentially 
pull in the data a couple of times, which means that you're
not being super efficient. If there is a huge amount of data coming in, this 
may be less than ideal. But if the data is coming in slowly, like
10 MB/sec then maybe this is fine.

Does any of this help?

Thanks
-Mark

----------------------------------------
> Date: Thu, 21 May 2015 20:01:30 -0700
> From: [email protected]
> To: [email protected]
> Subject: Re: Fetch change list
>
> Hi Mark,
> I downloaded the latest version and I see that the FetchHDFS processor
> could be used for my delta files that have arrived to the HDFS. But how do I
> maintain a *sync * from a local file system to my HDFS. I cannot move files
> from the local filesystem. It needs to be copied.
>
> I'm facing issues with queueing trying to maintain a sync.
>
> Any thoughts on how I could tackle this issue?
>
> Regards,
> anup
>
>
>
> --
> View this message in context: 
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1615.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at 
> Nabble.com.
                                          

Reply via email to