dev,

I was having a chat with Pierre around PR#379 and we thought it would be
worth sharing this with the wider group:


I recently noticed that we merged a number of PRs and merges around
scale-out/cloud based object store into the master.

Would it make sense to start considering adopting a pattern where
Put/Get/ListHDFS are used in tandem with implementations of the
hadoop.filesystem interfaces instead of creating new processors, except
where a particular deficiency/incompatibility in the hadoop.filesystem
implementation exists?

Candidates for removal / non merge would be:

- Alluxio (PR#379)
- WASB (PR#626)
 - Azure* (PR#399)
- *GCP (recently merged as PR#1482)
- *S3 (although this has been in code so it would have to be deprecated)

The pattern would be pretty much the same as the one documented and
successfully deployed here:

https://community.hortonworks.com/articles/71916/connecting-
to-azure-data-lake-from-a-nifi-dataflow.html

Which means that in the case of Alluxio, one would use the properties
documented here:

https://www.alluxio.com/docs/community/1.3/en/Running-
Hadoop-MapReduce-on-Alluxio.html

While with Google Cloud Storage we would use the properties documented here:

https://cloud.google.com/hadoop/google-cloud-storage-connector

I noticed that specific processors could have the ability to handle
particular properties to a filesystem, however I would like to believe the
same issue would plague hadoop users, and therefore is reasonable to
believe the Hadoop compatible implementations would have ways of exposing
those properties as well?

In the case the properties are exposed, we perhaps simply adjust the *HDFS
processors to use dynamic properties to pass those to the underlying
module, therefore providing a way to explore particular settings of an
underlying storage platforms.

Any opinion would be welcome

PS-sent it again with proper subject label

Reply via email to