Hej, Up to what sizes are broadcast sets a good idea?
I have large dataset (~5 GB) and I'm only interested in lines with a certain ID that I have in a file. The file has ~10 k entries. I could either Join the dataset with the IDList or I could broadcast the ID list and do the filtering in a Mapper.
What would be the better solution given the data sizes described above? Is there a good rule of thumb when to switch from one solution to the other? cheers Martin