Hi have a hard problem I have 10114 column vectors each in a separate file. The file has 2 columns, the row id, and numeric values. The row ids are identical and in sort order. All the column vectors have the same number of rows. There are over 5 million rows. I need to combine them into a single table. The row ids are very long strings. The column names are about 20 chars long.
My current implementation uses join. This takes a long time on a cluster with 2 works totaling 192 vcpu and 2.8 tb of memory. It often crashes. I mean totally dead start over. Checkpoints do not seem help, It still crashes and need to be restarted from scratch. What is really surprising is the final file size is only 213G ! The way got the file was to copy all the column vectors to a single BIG IRON machine and used unix cut and paste. Took about 44 min to run once I got all the data moved around. It was very tedious and error prone. I had to move a lot data around. Not a particularly reproducible process. I will need to rerun this three more times on different data sets of about the same size I noticed that spark has a union function(). It implements row bind. Any idea how it is implemented? Is it just map reduce under the covers? My thought was 1. load each col vector 2. maybe I need to replace the really long row id strings with integers 3. convert column vectors into row vectors using piviot (Ie matrix transpose.) 4. union all the row vectors into a single table 5. piviot the table back so I have the correct column vectors I could replace the row ids and column name with integers if needed, and restore them later Maybe I would be better off using many small machines? I assume memory is the limiting resource not cpu. I notice that memory usage will reach 100%. I added several TB’s of local ssd. I am not convinced that spark is using the local disk will this perform better than join? * The rows before the final pivot will be very very wide (over 5 million columns) * There will only be 10114 rows before the pivot I assume the pivots will shuffle all the data. I assume the Colum vectors are trivial. The file table pivot will be expensive however will only need to be done once Comments and suggestions appreciated Andy