On 6/30/21 9:52 PM, Boyd Kelly wrote: > Hello, > > Not sure if I am doing this correctly, but I seem to have a problem sorting > when the first column of my data contains a uuid often with embedded > commas. Does sort deal with this properly? > > My file contains this: (where the first and last fields may contain > embedded commas. > > "}~,)4",nɛgɛso,yq,,vélo,,,,,0630 > "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens" > ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase" > > The following is not correctly sorted by the second field > $ sort -t "," -k 2 test2.csv > "}~,)4",nɛgɛso,yq,,vélo,,,,,0630 > "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens" > ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
When using the comma ',' character as separator, then the data in the literally goes until the first comma: "}~ "wjv( ARm}q0bjb# The --debug option shows that as well: $ sort --debug -t "," -k 2 test2.csv sort: text ordering performed using simple byte comparison "}~,)4",nɛgɛso,yq,,vélo,,,,,0630 _______________________________ ___________________________________ "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens" ________________________________________________ ______________________________________________________ ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase" ____________________________________________________ _______________________________________________________________ In that sense, it is not foreseen that the delimiter character is part of the fields. Obviously, the data comes from a tool which tries to work around this by enclosing such data with a " as string delimiter, but that is not what sort(1) is expecting. And to my knowledge, there's no way to tell sort that it should treat the delimiter character somewhere in there special. Instead, you should use a delimiter character which does or can not occur in the original data; e.g. a tab '\t', a semicolon ';', a hash '#', a '@' or maybe even a non-printable character. Alternatively, you could also enforce that all column values get quoted by "....": "}~,)4","nɛgɛso","yq","","vél","","","","","0630" "wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens" "ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630, phrase" and then use the decorate-sort-undecorate (DSU) approach by changing "," (quote-comma-quote) to a single and safe delimiter, sort by that delimiter, and then change the delimiter back to ",". $ sed 's/","/\t/g' < test2.csv-all-columns-quoted \ | sort -t $'\t' -k2,2 \ | sed 's/\t/","/g' "ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630, phrase" "wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens" "}~,)4","nɛgɛso","yq","","vél","","","","","0630" Still, this is only working under certain assumptions. E.g. how does it look when the original data has a "," inside? Or what if a field contains a newline character? An CSV export is not really optimal for this. If you have the data e.g. in LibreOffice Calc, the best bet is maybe to use the CSV export with "fixed column width" (which is not really comma-separated anymore), and then sort by specific offset-numbers }~,)4 nɛgɛso yq vélo 630 wjv(,4q_S}badenmuso cj soeur 0630, fam, gen ARm}q0bjb#a be baara kɛbk il travai 0630, phrase Then running sort with the DSU idiom: $ tr '\n' '\0' < test-with-fixed-column-width.csv \ | sort -z -k 1.11,1.23 \ | tr '\0' '\n' Of course, this still only works when none of the fields contain a newline (which is a ambiguity in the input, not in sort). Have a nice day, Berny