Not a max - all values are needed. pivot() if anything is much closer, but see the rest of this thread.
On Thu, Apr 21, 2022 at 1:19 AM Sonal Goyal <sonalgoy...@gmail.com> wrote: > Seems like an interesting problem to solve! > > If I have understood it correctly, you have 10114 files each with the > structure > > rowid, colA > r1, a > r2, b > r3, c > .......5 million rows > > if you union them, you will have > rowid, colA, colB > r1, a, null > r2, b, null > r3, c, null > r1, null, d > r2, null, e > r3, null, f > > Will a window partition by rowid and max on column values not work ? > > Cheers, > Sonal > https://github.com/zinggAI/zingg > > > > On Thu, Apr 21, 2022 at 6:50 AM Sean Owen <sro...@gmail.com> wrote: > >> Oh, Spark directly supports upserts (with the right data destination) and >> yeah you could do this as 10000+ updates to a table without any pivoting, >> etc. It'd still end up being 10K+ single joins along the way but individual >> steps are simpler. It might actually be pretty efficient I/O wise as >> columnar formats would not rewrite any other data on a write like this. >> >> On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu> >> wrote: >> >>> Hi Sean >>> >>> >>> >>> My “insert” solution is hack that might work give we can easily spin up >>> a single VM with a crazy amouts of memory. I would prefer to see a >>> distributed solution. It is just a matter of time before someone want to >>> create an even bigger table using cbind. >>> >>> >>> >>> I understand you probably already know a lot about traditional RDBS’s. >>> Much of my post is back ground for others >>> >>> >>> >>> I used to do some of classic relational database work before tools like >>> Hadoop, spark and NoSQL became available . >>> >>> >>> >>> The standard operations on a single table in a relation database are >>> >>> >>> >>> Insert “row”. This is similar to spark union. Typically primary keys in >>> in rbdms tables are indexed to enable quick look up. So insert is >>> probably not 1 for. 1 with union. The row may not simply be appended to the >>> end of the table. >>> >>> >>> >>> Update a “row” >>> >>> Delete a “row” >>> >>> Select “rows where” >>> >>> >>> >>> Rdms server enable row and table level locking. Data must always be in a >>> consistent state. You must commit or abort you changes for them to persist >>> and to release locks on the data. Locks are required because you have a >>> single resource and may user requesting service simultaneously. This is >>> very different from Spark >>> >>> >>> >>> Storage and memory used to be really expensive so often people tried to >>> create “1st normal form” schemas. I.E. no duplicate data to reduce >>> hardware cost. 1st normal design require you to use joins to the get >>> data table you want. Joins are expensive. Often design duplicated some data >>> to improve performance by minimize the number of joins required. Duplicate >>> data make maintaining consistency harder. There are other advantages to >>> normalized data design and as we are all aware in the bigdata world lots of >>> disadvantages. The dbms ran on a single big machine. Join was not >>> implemented as distributed map/reduce. >>> >>> >>> >>> So My idea is use a traditional RDMS server: my final table will have 5 >>> million rows and 10,114 columns. >>> >>> 1. Read the column vector from each of 10,114 data files >>> 2. insert the column vector as a row in the table >>> 1. I read a file that has a single element on each line. All I >>> need to do is replace \n with , >>> 3. Now I have table with 10,115 rows and 5 million columns >>> 4. The row id (primary key) is the original file name >>> 5. The columns are the row ids in the original column vectors >>> 6. Now all I need to do is pivot this single table to get what I >>> want. This is the only join or map/reduce like operation >>> 7. A table with 5million rows and 10,114 columns >>> >>> >>> >>> >>> >>> My final table is about 220 gb. I know at google my I have quota for up >>> 2 mega mem machines. Each one has some think like 1.4 Tb of memory >>> >>> >>> >>> Kind regards >>> >>> >>> >>> Andy >>> >>> >>> >>>