+1
> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: > > If you need to use single inserts, updates, deletes, select why not use hbase > with Phoenix? I see it as complementary to the hive / warehouse offering > >> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >> >> Hi, >> >> I decided to create a catalog table in Hive ORC and transactional. That >> table has two columns of value >> >> transactiondescription === account_table.transactiondescription >> hashtag String column created from a semi automated process of deriving it >> from account_table.transactiondescription >> Once the process is complete in populating the catalog table then we just >> need to create a new DF based on join between catalog table and the >> account_table. The join will use hashtag in catalog table to loop over debit >> column in account_table for a given hashtag. That is pretty fast as going >> through pattern matching is pretty intensive in any application and database >> in real time. >> >> So one can build up the catalog table over time as a reference table. I am >> sure such tables exist in commercial world. >> >> Anyway after getting results out I know how I am wasting my money on >> different things, especially on clothing etc :) >> >> >> HTH >> >> P.S. Also there is an issue with Spark not being able to read data through >> Hive transactional tables that have not been compacted yet. Spark just >> crashes. If these tables need to be updated regularly say catalog table and >> they are pretty small, one might maintain them in an RDBMS and read them >> once through JDBC into a DataFrame in Spark before doing analytics. >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote: >>> Hi Mich, >>> >>> It seems like an entity resolution problem - looking at different >>> representations of an entity - SAINSBURY in this case and matching them all >>> together. How dirty is your data in the description - are there stop words >>> like SACAT/SMKT etc you can strip off and get the base retailer entity ? >>> >>> Best Regards, >>> Sonal >>> Founder, Nube Technologies >>> Reifier at Strata Hadoop World >>> Reifier at Spark Summit 2015 >>> >>> >>> >>> >>> >>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh >>>> <mich.talebza...@gmail.com> wrote: >>>> Thanks. >>>> >>>> I believe there is some catalog of companies that I can get and store it >>>> in a table and math the company name to transactiondesciption column. >>>> >>>> That catalog should have sectors in it. For example company XYZ is under >>>> Grocers etc which will make search and grouping much easier. >>>> >>>> I believe Spark can do it, though I am generally interested on alternative >>>> ideas. >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>> loss, damage or destruction of data or any other property which may arise >>>> from relying on this email's technical content is explicitly disclaimed. >>>> The author will in no case be liable for any monetary damages arising from >>>> such loss, damage or destruction. >>>> >>>> >>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: >>>>> Well, if you still want to use windows function for your logic, then you >>>>> need to derive a new column out, like "catalog", and use it as part of >>>>> grouping logic. >>>>> >>>>> >>>>> Maybe you can use regex for deriving out this new column. The >>>>> implementation needs to depend on your data in "transactiondescription", >>>>> and regex gives you the most powerful way to handle your data. >>>>> >>>>> >>>>> This is really not a Spark question, but how to you process your logic >>>>> based on the data given. >>>>> >>>>> >>>>> Yong >>>>> >>>>> >>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> Sent: Tuesday, August 2, 2016 10:00 AM >>>>> To: user @spark >>>>> Subject: Extracting key word from a textual column >>>>> >>>>> Hi, >>>>> >>>>> Need some ideas. >>>>> >>>>> Summary: >>>>> >>>>> I am working on a tool to slice and dice the amount of money I have spent >>>>> so far (meaning the whole data sample) on a given retailer so I have a >>>>> better idea of where I am wasting the money >>>>> >>>>> Approach >>>>> >>>>> Downloaded my bank statements from a given account in csv format from >>>>> inception till end of July. Read the data and stored it in ORC table. >>>>> >>>>> I am interested for all bills that I paid using Debit Card ( >>>>> transactiontype = "DEB") that comes out the account directly. >>>>> Transactiontype is the three character code lookup that I download as >>>>> well. >>>>> >>>>> scala> ll_18740868.printSchema >>>>> root >>>>> |-- transactiondate: date (nullable = true) >>>>> |-- transactiontype: string (nullable = true) >>>>> |-- sortcode: string (nullable = true) >>>>> |-- accountnumber: string (nullable = true) >>>>> |-- transactiondescription: string (nullable = true) >>>>> |-- debitamount: double (nullable = true) >>>>> |-- creditamount: double (nullable = true) >>>>> |-- balance: double (nullable = true) >>>>> >>>>> The important fields are transactiondate, transactiontype, >>>>> transactiondescription and debitamount >>>>> >>>>> So using analytics. windowing I can do all sorts of things. For example >>>>> this one gives me the last time I spent money on retailer XYZ and the >>>>> amount >>>>> >>>>> SELECT * >>>>> FROM ( >>>>> select transactiondate, transactiondescription, debitamount >>>>> , rank() over (order by transactiondate desc) AS rank >>>>> from accounts.ll_18740868 where transactiondescription like '%XYZ%' >>>>> ) tmp >>>>> where rank <= 1 >>>>> >>>>> And its equivalent using Windowing in FP >>>>> >>>>> import org.apache.spark.sql.expressions.Window >>>>> val wSpec = >>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >>>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show >>>>> >>>>> >>>>> +---------------+----------------------+----+ >>>>> |transactiondate|transactiondescription|rank| >>>>> +---------------+----------------------+----+ >>>>> | 2015-12-15| XYZ LTD CD 4636 | 1| >>>>> +---------------+----------------------+----+ >>>>> >>>>> So far so good. But if I want to find all I spent on each retailer, then >>>>> it gets trickier as a retailer appears like below in the column >>>>> transactiondescription: >>>>> >>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >>>>> +----------------------+ >>>>> |transactiondescription| >>>>> +----------------------+ >>>>> | SAINSBURYS SMKT C...| >>>>> | SACAT SAINSBURYS ...| >>>>> | SAINSBURY'S SMKT ...| >>>>> | SAINSBURYS S/MKT ...| >>>>> | SACAT SAINSBURYS ...| >>>>> +----------------------+ >>>>> >>>>> If I look at them I know they all belong to SAINBURYS (food retailer). I >>>>> have done some crude grouping and it works somehow >>>>> >>>>> //define UDF here to handle substring >>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >>>>> s.substring(start, end) } >>>>> var cutoff = "CD" // currently used in the statement >>>>> val wSpec2 = >>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>> cutoff)-1)) >>>>> ll_18740868.where($"transactiontype" === "DEB" && >>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>> >>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >>>>> >>>>> However, I really need to extract the "keyword" retailer name from >>>>> transactiondescription column And I need some ideas about the best way of >>>>> doing it. Is this possible in Spark? >>>>> >>>>> Thanks >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>> loss, damage or destruction of data or any other property which may arise >>>>> from relying on this email's technical content is explicitly disclaimed. >>>>> The author will in no case be liable for any monetary damages arising >>>>> from such loss, damage or destruction. >>