Hi, I recently needed to match OS fingerprints from all the Internet Census 2012 data collection ( http://internetcensus2012.bitbucket.org/paper.html ). In order to do that, I found fingermatch - a tool that expects the Nmap fingerprint to be entered via the standard input and (after my modifications) prints out a single line of output. Then I needed a tool to merge columns from the input with the output, so that if I entered:
<some ip> <timestamp> <fingerprint> The <fingerprint> column went to the "fingermatch" standard input, my supervisor script read its output and printed something like this: <some ip> <timestamp> <fingermatch output> I quickly realized that the performance of fingermatch wasn't satisfactory for me and I enhanced my supervisor script with multithreading support. I wasn't aware of GNU parallel back then, so I wrote some of its functionality myself. Now that I have another massive task to perform (bulk rDNS querying of some of the hosts), I wanted to use GNU parallel to perform it, but even after reading the tool's man page (expect for examples, so far), I find it hard it to replicate the following pattern: 1. Read lots of newline-delimited input 2. Spawn N processes 3. Feed all the idle processes (i.e. not being in the middle of the read operation) with the input, line-by-line 4. Perform a blocking read on the processes in order to read the output line from them 5. Print the line in a synchronized manner, so that stdout from from the programs doesn't overlap (AFAIR, GNU parallel already does that) 6. Should any of the processes die, respawn it How much of the functionality can currently be achieved with GNU Parallel? Please note the shift from many short-time worker processes that terminate after one piece of input to few long-living processes and the monitoring of their state (whether we're reading from them or not). Yours, Jacek Wielemborek
