Hello, I've found some very interesting behaviour when subjecting various awk implementations to some very specific circumstances.
I'm basically looking for a sanity check here to confirm if I'm just wildly
flailing, or if I am indeed onto something here.
Here's my situation:
When parsing some RIR data in parallel using awk with xargs, I seem to have
found a way to reliable lose and/or mangle output with parallel xargs. My
google-fu seems to be failing me. I understand that xargs does not buffer
output and that lines may arrive out of order, but in this case I am reliably
and reproducibly losing data and receiving mangled output. But wait, it gets
stranger.
I don't want to lose you guys here with a long winded explanation, so I'm going
to show you a diff that shows reproducibly mangled output when using xargs in
parallel mode:
--- /tmp/bad.txt Wed Apr 14 21:06:51 2021
+++ /tmp/good.txt Wed Apr 14 21:06:41 2021
@@ -1,5 +1,3 @@
-267386
-A264890
AS262399
AS262400
AS262401
@@ -1774,6 +1772,7 @@
AS264887
AS264888
AS264889
+AS264890
AS264891
AS264892
AS264893
@@ -3552,6 +3551,7 @@
AS267383
AS267384
AS267385
+AS267386
AS267387
AS267388
AS267389
@@ -4220,6 +4220,7 @@
AS268318
AS268319
AS268320
+AS268320
AS268321
AS268321
AS268323
@@ -7785,6 +7786,7 @@
AS270633
AS270633
AS270634
+AS270634
AS270635
AS270635
AS270636
@@ -10277,5 +10279,3 @@
AS46210
AS46280
AS46280
-ASAS268320
-ASS270634
The only thing that changed between these runs was me using either xargs -P 1
or -P 2.
To allow folks to follow along with me at home, I've included the two files
(gzipped for politeness) I used to trigger this behaviour.
Once you've extracted the attached text files into your working directory,
here's a snippet that should reproduce my issue:
$ printf 'BR\nCA\n' > cc.txt
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|'
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }'
cc.txt
What does this 1 liner do, well it's supposed to slurp the country codes
specified in cc.txt into an array where we then check the first field of each
row of the RIR data against. If the first field matches a country code in the
array and the second field indicates that this row is an ASN record, then we
print the 3rd field prepended with 'AS'. As you can see, if you grep the output
of the above command for the string "ASAS", "ASS" or 'A2' you should see some
mangled ASNs. If you change "-P 2" to "-P 1" this mangling will not occur.
Here's where things get very weird. While parsing this data (as part of a
larger dataset comprising an aggregation of all the registrar delegation
statistics) I've been using this snippet for a while to quickly fetch ASN
records. It is not until I have BOTH the BR and CA country codes in the array
that I can trigger this bug. I can have any number of country codes in the
array, but if Brazil AND Canada happen to be specified in the array, then I get
mangled output, but ONLY if executed with parallel xargs. This reproducibly
happens when using awk, gawk or mawk. To further melt your brain, this
behaviour has NOT been observed when using goawk, a POSIX compliant awk
implementation written in go.
Just to prove my point, here's me testing the hash outputs between various awk
implementations with my above 1 liner:
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- awk -F '|'
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }'
cc.txt | sort | md5
2a20f44ce6a23d5c49b05b9f2689ef93
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- awk -F '|'
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }'
cc.txt | sort | md5
9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 -- mawk -F '|'
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }'
cc.txt | sort | md5
2a20f44ce6a23d5c49b05b9f2689ef93
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 -- mawk -F '|'
'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" { printf("AS%s\n", $3) }'
cc.txt | sort | md5 >
9ab3dbfbff5746f059cdb35221ff73b1
---
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 2 --
~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" {
printf("AS%s\n", $3) }' cc.txt | sort | md>
9ab3dbfbff5746f059cdb35221ff73b1
$ find . -type f -name "[12].txt" -print0 | xargs -0 -n 1 -P 1 --
~/go/bin/goawk -F '|' 'NR==FNR { A[$1]=1 ; next } $1 in A && $2 == "asn" {
printf("AS%s\n", $3) }' cc.txt | sort | md
9ab3dbfbff5746f059cdb35221ff73b1
I've racked my brain and the internet for hours, I've tested and toiled, and
I'm left thoroughly perplexed. I now humbly ask the fine folks here in OpenBSD
Land for guidance, insight or suggestions.
As always, is this a bug, or am I holding it wrong?
Regards,
Jordan
1.txt.gz
Description: application/gzip
2.txt.gz
Description: application/gzip

