Re: Sorting csv with embedded commas problem

Bernhard Voelker Thu, 01 Jul 2021 15:06:03 -0700

On 6/30/21 9:52 PM, Boyd Kelly wrote:
> Hello,
> 
> Not sure if I am doing this correctly, but I seem to have a problem sorting
> when the first column of my data contains a uuid often with embedded
> commas.  Does sort deal with this properly?
> 
> My file contains this: (where the first and last fields may contain
> embedded commas.
> 
> "}~,)4",nɛgɛso,yq,,vélo,,,,,0630
> "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
> ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
> 
> The following is not correctly sorted by the second field
> $  sort -t "," -k 2 test2.csv
> "}~,)4",nɛgɛso,yq,,vélo,,,,,0630
> "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
> ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"


When using the comma ',' character as separator, then the data in the
literally goes until the first comma:

  "}~
  "wjv(
  ARm}q0bjb#

The --debug option shows that as well:

  $ sort --debug -t "," -k 2 test2.csv
  sort: text ordering performed using simple byte comparison
  "}~,)4",nɛgɛso,yq,,vélo,,,,,0630
      _______________________________
  ___________________________________
  "wjv(,4q_S}",badenmuso,cj,,soeur,,,,,"0630, fam, gens"
        ________________________________________________
  ______________________________________________________
  ARm}q0bjb#,a be baara kɛ.,bk,,il travaille.,,,,,"0630, phrase"
             ____________________________________________________
  _______________________________________________________________

In that sense, it is not foreseen that the delimiter character is part of
the fields.  Obviously, the data comes from a tool which tries to work around
this by enclosing such data with a " as string delimiter, but that is not what
sort(1) is expecting.  And to my knowledge, there's no way to tell sort that
it should treat the delimiter character somewhere in there special.

Instead, you should use a delimiter character which does or can not occur in
the original data; e.g. a tab '\t', a semicolon ';', a hash '#', a '@' or
maybe even a non-printable character.

Alternatively, you could also enforce that all column values get quoted by 
"....":

  "}~,)4","nɛgɛso","yq","","vél","","","","","0630"
  "wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens"
  "ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630, 
phrase"

and then use the decorate-sort-undecorate (DSU) approach by changing "," 
(quote-comma-quote)
to a single and safe delimiter, sort by that delimiter, and then change the 
delimiter back
to ",".

  $ sed 's/","/\t/g' < test2.csv-all-columns-quoted \
      | sort  -t $'\t' -k2,2 \
      | sed 's/\t/","/g'
  "ARm}q0bjb#","a be baara kɛ.","bk","","il travaille.","","","","","0630, 
phrase"
  "wjv(,4q_S}","badenmuso","cj","","soeur","","","","","0630, fam, gens"
  "}~,)4","nɛgɛso","yq","","vél","","","","","0630"

Still, this is only working under certain assumptions.  E.g. how does it look 
when the
original data has a "," inside?  Or what if a field contains a newline 
character?
An CSV export is not really optimal for this.

If you have the data e.g. in LibreOffice Calc, the best bet is maybe to use the 
CSV export
with "fixed column width" (which is not really comma-separated anymore), and 
then sort by
specific offset-numbers

}~,)4     nɛgɛso       yq            vélo                                       
                     630
wjv(,4q_S}badenmuso    cj            soeur                                      
          0630, fam, gen
ARm}q0bjb#a be baara kɛbk            il travai                                  
          0630, phrase

Then running sort with the DSU idiom:

  $ tr '\n' '\0' < test-with-fixed-column-width.csv \
      | sort -z -k 1.11,1.23 \
      | tr '\0' '\n'

Of course, this still only works when none of the fields contain a newline 
(which is a ambiguity
in the input, not in sort).

Have a nice day,
Berny

Re: Sorting csv with embedded commas problem

Reply via email to