At last night meeting someone asked about ways to transpose the rows and columns in a CSV file that was too large to fit into memory and potentially had thousands of columns and hundreds of thousands of rows.

If your data fits the criteria of having reasonably uniform cell widths, say for example no cells holding more than 50 bytes, you could use a sparse file with a fixed width record structure. You would make a single pass through your source data file, parsing the CSV only once, and make random (as in random access) seeks into the target file, following a simple formula to determine the cell's offset.

(If a sparse file isn't doable, but you have lots of disk space, you can instead create the target file by padding it with enough nulls to hold the number of cells you are expecting. Using dd to copy blocks from /dev/zero would do the trick.)

Conceptually this isn't all that different from using a database, as someone suggested, but should have much less overhead.

Of course if the desired end result is to have another CSV file, you'll have to write another chunk of code to convert the fixed record file back to CSV.

Whether millions of random seeks and a post conversion step really provide enough of a performance gain over making multiple passes through the source file is debatable. If this is a one-time conversion effort, the multi pass approach might be the way to go.

A variation on this idea, which would eliminate copying the data to an intermediary file, and would be able to tolerate widely varying cell widths, is to make one pass through the source file and generate an index for it - say the byte offset of each row, and then the relative offset and length of each cell - which, if you're lucky, might fit into memory, otherwise you'd again use a sparse file. Then you'd walk through the index, make random seeks into the source file, and write the cells to your target CSV file in the desired order. Not quite a single pass on the source file, but at least you'd have the overhead of performing the CSV parsing only once.

 -Tom
_______________________________________________
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to