This is a very interesting area of integration investigation Marc, thank you for sharing your work!
I looked into this a little after conversations with folks in security applications, and I wonder if you investigated approaches to tracking and reporting/handling packet loss and error rates in this? The interest was in reasoning about loss rates, and the completeness of received data - something with a simple merge>put>diode>get>unpack would not manage I think. I was looking at Longhair <https://github.com/catid/longhair>, and similar reed-solomon approaches, as a method of breaking down arbitrary files and transmitting for reconstitution over diodes that may have lossy behavior in Field scenarios. I also looked a little into transmitting manifests for downstream reconciliation, but this unravelled to be more complex an operation than would suit a pure NiFi implementation, so I started on the path of Kafka/Flink as a streaming-reconciliation service but quickly realised i was creating a monster without commercial interest :) Both approaches are easier for fewer larger files than millions of tiny messages in terms of practicality, and if you had very reliable diode transmission the overhead of ecc/reconciliation may not be worthwhile. Other implementations I had seen (like ZeroMQ radio/dish or blindFTP) seemed to talk about provable delivery as a potential requirement, but I only found the more simplistic 'my network is reliable and any packet loss is negligible anyway' approaches. I suspect the implementations of these more robust approaches are reserved for commercial offerings... Anyway, I appreciate that you may not be able to share more details on this, but you reminded me of enjoying the investigation when I looked at it so I thought I'd say thanks for that. On Tue, Aug 3, 2021 at 2:55 AM Phil H <[email protected]> wrote: > Adam, that's true, although if your data size is larger than network > MTU there can be some disconnect there. > > Connection per flow file is pretty slow for sustained high traffic > flows though (can't recall the establishment times off the top of my > head, but they are non-trivial). > > On Tue, Aug 3, 2021 at 8:39 AM Adam Taft <[email protected]> wrote: > > > > Just spitballing a little here. If you set the configuration of the > PutTCP > > processor property "Connection per Flowfile" to 'true' and you leave the > > "Outgoing Message Delimiter" as blank (none), then I don't think you have > > the delimiter problem that you both are describing. I could be wrong > though? > > > > I would consider it a bug if you couldn't send a "raw" > connection-oriented > > object over PutTCP. With that processor, the goal would be to: a) open a > > socket, b) dump whatever binary you have prepared over it, c) close the > > socket to signal completion of transfer. If PutTCP doesn't work this way > > (byte-for-byte), it should probably be flagged as a bug (its original > > intention was exactly this use case). > > > > That being said, I still think custom FlowFile serialization might be > > something that is outside of the concern of the transport. I personally > > think serializing/deserializing is a different concern from transport. > > Arguably, sometimes the semantics of the transport protocol requires you > to > > prepare the message itself in a protocol accommodating way (HTTP being an > > obvious example of this, or packet ordering in Marc's UDP example). But a > > new JSON flowfile serialization seems like it could be a separate > > processor, not commingled into an existing one. > > > > MergeContent / UnpackContent work in tandem and have a "FlowFile Stream > v3" > > format that can serialize/deserialize multiple flowfiles together into a > > single byte stream. This allows transport over any protocol, including > > file-based, socket-based, etc. > > > > Marc: Your mention of performance is, of course, appropriate for the > scale > > that you're talking about (Gbps). Maybe there's some performance > > improvements that could be garnered from your work applicable to the > > "standard" processors I mentioned. And I definitely didn't mean to imply > > you were doing "anything wrong". Just legitimately curious as to your > > thought process and design approach. > > > > OK, I'll step off a little, because I might be probing too hard here. > But I > > was legitimately curious about the intention of the proposed processor as > > it relates to the mentioned Diode device. > > > > Thanks, > > > > Adam > > > > > > On Mon, Aug 2, 2021 at 4:15 PM Phil H <[email protected]> wrote: > > > > > Hi Marc, > > > > > > Thanks for the additional info. Just so you know you’re not the only > > > one, I’ve also had to re-implement a ListenTCP alternative to get > > > around the byte delimeter issue for binary and multiline text data. > > > > > > Phil > > > > > > > > > On Tue, Aug 3, 2021 at 6:59 AM Marc <[email protected]> wrote: > > > > > > > > Hi Adam, > > > > > > > > more or less it is a ‚merge', puttcp, listentcp and unpack. I hope > that > > > I am not wrong but the nifi ListenTCP processor uses a delimiter (\n as > > > default?). If you are transferring binary data the processor splits the > > > flow into ‚pieces'. And the attributes are not transferred to the > > > destination. > > > > > > > > But your idea describes what the processor is doing. > > > > > > > > 1. It converts the attributes to a json string > > > > 2. It transfers the json string and the payload (there is a header > that > > > tells the destination how long the json header and how long the > payload is) > > > > 3. The Listener gets the flow and decodes the header (to get the > size of > > > the json header and the payload) > > > > 4. It writes the payload to a flow > > > > 5. It converts the json string and sets the attributes to the flow > > > > > > > > If you do not want to transfer attributes you can configure a > different > > > decoder. In this case you can just ‚nectat‘ a binary file to nifi. > > > > > > > > The UDP version is far more complex. There must be a counter to tell > the > > > destination what part of the flow file was received (even in a diode > > > environment packets are not received in the right order!). And you > must be > > > fast, very fast. It is a multithreaded architecture because one thread > > > cannot receive, decode, and write a gigabit per second. I used the > > > disruptor library. Receive a packet in one thread, decode it in another > > > thread. A third thread gets the packet and write the content in the > right > > > order to a flow. > > > > > > > > I am still learning (and I am not a professional software > developer). If > > > I did something wrong or oversaw something please tell me. > > > > > > > > Marc > > > > > > > > > Am 02.08.2021 um 22:01 schrieb Adam Taft <[email protected]>: > > > > > > > > > > Marc, > > > > > > > > > > How would this differ from a more generic use of the existing > > > processors, > > > > > PutTCP/ListentTCP and PutUDP/ListenUDP? I'm not sure what value is > > > being > > > > > added above these existing processors, but I'm sure I'm missing > > > something. > > > > > > > > > > There's already an ability to serialize flowfiles via > MergeContent. And > > > > > there's the deserialize side in UnpackContent. So a dataflow that > looks > > > > > like the following would seem a reasonable approach to the problem: > > > > > > > > > > MergeContent -> PutTCP -> {diode} -> ListentTCP -> UnpackContent > > > > > > > > > > I'm actually very interested in this topic, having a project that > has > > > a use > > > > > case for a "diode". So I'm legitimately asking here, not trying to > > > derail > > > > > your work. > > > > > > > > > > Thanks in advance, > > > > > > > > > > Adam > > > > > > > > > > On Sun, Aug 1, 2021 at 12:26 PM Marc <[email protected]> wrote: > > > > > > > > > >> Greetings, > > > > >> > > > > >> there are companies and organizations that strictly separate their > > > > >> networks for security reasons. Such companies often use diodes to > > > achieve > > > > >> this. But of course they still have to exchange data between the > > > networks > > > > >> (eg. transfer data from ‚low‘ to ‚high‘). There are at least two > > > kinds of > > > > >> diodes. Some hardware-based ones only use one fiber optic to send > > > data (UDP > > > > >> based). Others use TCP, but prevent sending in the reverse > direction. > > > > >> > > > > >> Nifi is an amazing tool that allows data to be transferred > between two > > > > >> separate networks in a very flexible but also secure way. I have > > > > >> implemented two processors. The first one ‚merges‘ the attributes > and > > > the > > > > >> content of a flowfile and sends it to the destination. The second > one > > > > >> listens on a TCP port, splits attributes and content and creates > a new > > > > >> flowfile containing all attributes of the origin flow. You can > send > > > the > > > > >> flow without attributes as well. In this case you can easily > netcat a > > > > >> binary file to Nifi. > > > > >> > > > > >> These two processors are useful if you do NOT have a bidirectional > > > > >> communication between two NiFi instances and therefore the > site-2-site > > > > >> mechanism or http(s) cannot be used. > > > > >> > > > > >> We have been using these processors for a longer period of time > > > (exactly > > > > >> the version for 1.13.2) and would like to share these processors > with > > > > >> others. So the question to you all is: Is someone interested in > these > > > > >> processors or is this use case too special? > > > > >> > > > > >> The current source code can be found on GitHub. ( > > > > >> https://github.com/nerdfunk-net/diode/ < > > > > >> https://github.com/nerdfunk-net/diode/>) > > > > >> > > > > >> I have also implemented a UDP based version of the processor. Due > to > > > the > > > > >> nature of UDP, this is more complex and these processors are now > being > > > > >> tested. > > > > >> > > > > >> Best regards > > > > >> Marc > > > > > > > >
