Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode}
On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet <wael....@gmail.com> wrote: > I think you need some sort of fuzzy join ? > Is it always the case that one title is a substring of another ? > > On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh <suniti.si...@gmail.com> > wrote: > >> Hi All, >> >> I have two tables with same schema but different data. I have to join the >> tables based on one column and then do a group by the same column name. >> >> now the data in that column in two table might/might not exactly match. >> (Ex - column name is "title". Table1. title = "doctor" and Table2. title >> = "doc") doctor and doc are actually same titles. >> >> From performance point of view where i have data volume in TB , i am not >> sure if i can achieve this using the sql statement. What would be the best >> approach of solving this problem. Should i look for MLLIB apis? >> >> Spark Gurus any pointers? >> >> Thanks, >> Suniti >> >> >> > > > -- > > *Regards,* > Wail Alkowaileet >