I would stay away from transaction tables until they are fully baked. I do not see why you need to update vs keep inserting with timestamp and while joining derive latest value on the fly.
But I guess it has became a religious question now :) and I am not unbiased. On 3 Aug 2016 08:51, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote: > There are many ways of addressing this issue. > > Using Hbase with Phoenix adds another layer to the stack which is not > necessary for handful of table and will add to cost (someone else has to > know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase > table. It is faster) > > There may be say 100 new entries into this catalog table with multiple > updates (not a single DML) to get hashtag right. sometimes it is an > iterative process which results in many deltas. > > If that is needed done once a day or on demand, an alternative would be to > insert overwrite the transactional hive table with deltas into a text table > in Hive and present that one to Spark. This allows Spark to see the data. > > Remember if I use Hive to do the analytics/windowing, there is no issue. > The issue is with Spark that neither Spark SQL or Spark shell can use that > table. > > Sounds like an issue for Spark to resolve later. > > Another alternative one can leave the transactional table in RDBMS for > this purpose and load it into DF through JDBC interface. It works fine and > pretty fast. > > Again these are all workarounds. I discussed this in Hive forum. There > should be a way" to manually compact a transactional table in Hive" (not > possible now) and second point if Hive can see the data in Hive table, why > not Spark? > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote: > >> +1 >> >> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> If you need to use single inserts, updates, deletes, select why not use >> hbase with Phoenix? I see it as complementary to the hive / warehouse >> offering >> >> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Hi, >> >> I decided to create a catalog table in Hive ORC and transactional. That >> table has two columns of value >> >> >> 1. transactiondescription === account_table.transactiondescription >> 2. hashtag String column created from a semi automated process of >> deriving it from account_table.transactiondescription >> >> Once the process is complete in populating the catalog table then we just >> need to create a new DF based on join between catalog table and the >> account_table. The join will use hashtag in catalog table to loop over >> debit column in account_table for a given hashtag. That is pretty fast as >> going through pattern matching is pretty intensive in any application and >> database in real time. >> >> So one can build up the catalog table over time as a reference table. I >> am sure such tables exist in commercial world. >> >> Anyway after getting results out I know how I am wasting my money on >> different things, especially on clothing etc :) >> >> >> HTH >> >> P.S. Also there is an issue with Spark not being able to read data >> through Hive transactional tables that have not been compacted yet. Spark >> just crashes. If these tables need to be updated regularly say catalog >> table and they are pretty small, one might maintain them in an RDBMS and >> read them once through JDBC into a DataFrame in Spark before doing >> analytics. >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote: >> >>> Hi Mich, >>> >>> It seems like an entity resolution problem - looking at different >>> representations of an entity - SAINSBURY in this case and matching them all >>> together. How dirty is your data in the description - are there stop words >>> like SACAT/SMKT etc you can strip off and get the base retailer entity ? >>> >>> Best Regards, >>> Sonal >>> Founder, Nube Technologies <http://www.nubetech.co> >>> Reifier at Strata Hadoop World >>> <https://www.youtube.com/watch?v=eD3LkpPQIgM> >>> Reifier at Spark Summit 2015 >>> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> >>> >>> <http://in.linkedin.com/in/sonalgoyal> >>> >>> >>> >>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Thanks. >>>> >>>> I believe there is some catalog of companies that I can get and store >>>> it in a table and math the company name to transactiondesciption column. >>>> >>>> That catalog should have sectors in it. For example company XYZ is >>>> under Grocers etc which will make search and grouping much easier. >>>> >>>> I believe Spark can do it, though I am generally interested on >>>> alternative ideas. >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: >>>> >>>>> Well, if you still want to use windows function for your logic, then >>>>> you need to derive a new column out, like "catalog", and use it as part of >>>>> grouping logic. >>>>> >>>>> >>>>> Maybe you can use regex for deriving out this new column. The >>>>> implementation needs to depend on your data in "transactiondescription", >>>>> and regex gives you the most powerful way to handle your data. >>>>> >>>>> >>>>> This is really not a Spark question, but how to you process your logic >>>>> based on the data given. >>>>> >>>>> >>>>> Yong >>>>> >>>>> >>>>> ------------------------------ >>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> *Sent:* Tuesday, August 2, 2016 10:00 AM >>>>> *To:* user @spark >>>>> *Subject:* Extracting key word from a textual column >>>>> >>>>> Hi, >>>>> >>>>> Need some ideas. >>>>> >>>>> *Summary:* >>>>> >>>>> I am working on a tool to slice and dice the amount of money I have >>>>> spent so far (meaning the whole data sample) on a given retailer so I have >>>>> a better idea of where I am wasting the money >>>>> >>>>> *Approach* >>>>> >>>>> Downloaded my bank statements from a given account in csv format from >>>>> inception till end of July. Read the data and stored it in ORC table. >>>>> >>>>> I am interested for all bills that I paid using Debit Card ( >>>>> transactiontype = "DEB") that comes out the account directly. >>>>> Transactiontype is the three character code lookup that I download as >>>>> well. >>>>> >>>>> scala> ll_18740868.printSchema >>>>> root >>>>> |-- transactiondate: date (nullable = true) >>>>> |-- transactiontype: string (nullable = true) >>>>> |-- sortcode: string (nullable = true) >>>>> |-- accountnumber: string (nullable = true) >>>>> |-- transactiondescription: string (nullable = true) >>>>> |-- debitamount: double (nullable = true) >>>>> |-- creditamount: double (nullable = true) >>>>> |-- balance: double (nullable = true) >>>>> >>>>> The important fields are transactiondate, transactiontype, >>>>> transactiondescription and debitamount >>>>> >>>>> So using analytics. windowing I can do all sorts of things. For >>>>> example this one gives me the last time I spent money on retailer XYZ and >>>>> the amount >>>>> >>>>> SELECT * >>>>> FROM ( >>>>> select transactiondate, transactiondescription, debitamount >>>>> , rank() over (order by transactiondate desc) AS rank >>>>> from accounts.ll_18740868 where transactiondescription like >>>>> '%XYZ%' >>>>> ) tmp >>>>> where rank <= 1 >>>>> >>>>> And its equivalent using Windowing in FP >>>>> >>>>> import org.apache.spark.sql.expressions.Window >>>>> val wSpec = >>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >>>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show >>>>> >>>>> >>>>> +---------------+----------------------+----+ >>>>> |transactiondate|transactiondescription|rank| >>>>> +---------------+----------------------+----+ >>>>> | 2015-12-15| XYZ LTD CD 4636 | 1| >>>>> +---------------+----------------------+----+ >>>>> >>>>> So far so good. But if I want to find all I spent on each retailer, >>>>> then it gets trickier as a retailer appears like below in the column >>>>> transactiondescription: >>>>> >>>>> >>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >>>>> +----------------------+ >>>>> |transactiondescription| >>>>> +----------------------+ >>>>> | SAINSBURYS SMKT C...| >>>>> | SACAT SAINSBURYS ...| >>>>> | SAINSBURY'S SMKT ...| >>>>> | SAINSBURYS S/MKT ...| >>>>> | SACAT SAINSBURYS ...| >>>>> +----------------------+ >>>>> >>>>> If I look at them I know they all belong to SAINBURYS (food retailer). >>>>> I have done some crude grouping and it works somehow >>>>> >>>>> //define UDF here to handle substring >>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >>>>> s.substring(start, end) } >>>>> var cutoff = "CD" // currently used in the statement >>>>> val wSpec2 = >>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>> cutoff)-1)) >>>>> ll_18740868.where($"transactiontype" === "DEB" && >>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >>>>> >>>>> However, I really need to extract the "keyword" retailer name from >>>>> transactiondescription column And I need some ideas about the best way of >>>>> doing it. Is this possible in Spark? >>>>> >>>>> Thanks >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>> >>>> >>> >> >