Not a max - all values are needed. pivot() if anything is much closer, but
see the rest of this thread.

On Thu, Apr 21, 2022 at 1:19 AM Sonal Goyal <sonalgoy...@gmail.com> wrote:

> Seems like an interesting problem to solve!
>
> If I have understood it correctly, you have 10114 files each with the
> structure
>
> rowid, colA
> r1, a
> r2, b
> r3, c
> .......5 million rows
>
> if you union them, you will have
> rowid, colA, colB
> r1, a, null
> r2, b, null
> r3, c, null
> r1, null, d
> r2, null, e
> r3, null, f
>
> Will a window partition by rowid and max on column values not work ?
>
> Cheers,
> Sonal
> https://github.com/zinggAI/zingg
>
>
>
> On Thu, Apr 21, 2022 at 6:50 AM Sean Owen <sro...@gmail.com> wrote:
>
>> Oh, Spark directly supports upserts (with the right data destination) and
>> yeah you could do this as 10000+ updates to a table without any pivoting,
>> etc. It'd still end up being 10K+ single joins along the way but individual
>> steps are simpler. It might actually be pretty efficient I/O wise as
>> columnar formats would not rewrite any other data on a write like this.
>>
>> On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu>
>> wrote:
>>
>>> Hi Sean
>>>
>>>
>>>
>>> My “insert” solution is hack that might work give we can easily spin up
>>> a single VM with a crazy amouts of memory. I would prefer to see a
>>> distributed solution. It is just a matter of time before someone want to
>>> create an even bigger table using cbind.
>>>
>>>
>>>
>>> I understand you probably already know a lot about traditional RDBS’s.
>>> Much of my post is back ground for others
>>>
>>>
>>>
>>> I used to do some of classic relational database work before tools like
>>> Hadoop, spark and NoSQL became available .
>>>
>>>
>>>
>>> The standard operations on a single table in a relation database are
>>>
>>>
>>>
>>> Insert “row”. This is similar to spark union.  Typically primary keys in
>>>  in rbdms tables are indexed  to enable quick look up. So insert is
>>> probably not 1 for. 1 with union. The row may not simply be appended to the
>>> end of the table.
>>>
>>>
>>>
>>> Update a “row”
>>>
>>> Delete a “row”
>>>
>>> Select “rows where”
>>>
>>>
>>>
>>> Rdms server enable row and table level locking. Data must always be in a
>>> consistent state. You must commit or abort you changes for them to persist
>>> and to release locks on the data. Locks are required because you have a
>>> single resource and may user requesting service simultaneously. This is
>>> very different from Spark
>>>
>>>
>>>
>>> Storage and memory used to be really expensive so often people tried to
>>> create “1st normal form” schemas. I.E. no duplicate data to reduce
>>> hardware cost.  1st normal design require you to use joins to the get
>>> data table you want. Joins are expensive. Often design duplicated some data
>>> to improve performance by minimize the number of joins required. Duplicate
>>> data make maintaining consistency harder. There are other advantages to
>>> normalized data design and as we are all aware in the bigdata world lots of
>>> disadvantages. The dbms ran on a single big machine. Join was not
>>> implemented as distributed map/reduce.
>>>
>>>
>>>
>>> So My idea is use a traditional RDMS server: my final table will have 5
>>> million rows and 10,114 columns.
>>>
>>>    1. Read the column vector from each of 10,114 data files
>>>    2. insert the column vector as a row in the table
>>>       1. I read a file that has a single element on each line. All I
>>>       need to do is replace \n with ,
>>>    3. Now I have table with 10,115 rows and 5 million columns
>>>    4. The row id (primary key) is the original file name
>>>    5. The columns are the row ids in the original column vectors
>>>    6. Now all I need to do is pivot this single table to get what I
>>>    want. This is the only join or map/reduce like operation
>>>    7. A table with 5million rows and 10,114 columns
>>>
>>>
>>>
>>>
>>>
>>> My final table is about 220 gb. I know at google my I have quota for up
>>> 2 mega mem machines. Each one has some think like 1.4 Tb of memory
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>>

Reply via email to