Phoenix will become another standard query interface of hbase. I do not agree that using hbase directly will lead to a faster performance. It always depends how you use it. While it is another component, it can make sense to use it. This has to be evaluated on a case by case basis. If you only want to use Hive you can do the manual compaction as described in another answer to this thread.
I anyway recommend most of the people to use a Hadoop distribution where all these components are properly integrate and you get support. > On 03 Aug 2016, at 00:51, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > There are many ways of addressing this issue. > > Using Hbase with Phoenix adds another layer to the stack which is not > necessary for handful of table and will add to cost (someone else has to know > about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase table. > It is faster) > > There may be say 100 new entries into this catalog table with multiple > updates (not a single DML) to get hashtag right. sometimes it is an iterative > process which results in many deltas. > > If that is needed done once a day or on demand, an alternative would be to > insert overwrite the transactional hive table with deltas into a text table > in Hive and present that one to Spark. This allows Spark to see the data. > > Remember if I use Hive to do the analytics/windowing, there is no issue. The > issue is with Spark that neither Spark SQL or Spark shell can use that table. > > Sounds like an issue for Spark to resolve later. > > Another alternative one can leave the transactional table in RDBMS for this > purpose and load it into DF through JDBC interface. It works fine and pretty > fast. > > Again these are all workarounds. I discussed this in Hive forum. There should > be a way" to manually compact a transactional table in Hive" (not possible > now) and second point if Hive can see the data in Hive table, why not Spark? > > HTH > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote: >> +1 >> >>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>> If you need to use single inserts, updates, deletes, select why not use >>> hbase with Phoenix? I see it as complementary to the hive / warehouse >>> offering >>> >>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> I decided to create a catalog table in Hive ORC and transactional. That >>>> table has two columns of value >>>> >>>> transactiondescription === account_table.transactiondescription >>>> hashtag String column created from a semi automated process of deriving it >>>> from account_table.transactiondescription >>>> Once the process is complete in populating the catalog table then we just >>>> need to create a new DF based on join between catalog table and the >>>> account_table. The join will use hashtag in catalog table to loop over >>>> debit column in account_table for a given hashtag. That is pretty fast as >>>> going through pattern matching is pretty intensive in any application and >>>> database in real time. >>>> >>>> So one can build up the catalog table over time as a reference table. I am >>>> sure such tables exist in commercial world. >>>> >>>> Anyway after getting results out I know how I am wasting my money on >>>> different things, especially on clothing etc :) >>>> >>>> >>>> HTH >>>> >>>> P.S. Also there is an issue with Spark not being able to read data through >>>> Hive transactional tables that have not been compacted yet. Spark just >>>> crashes. If these tables need to be updated regularly say catalog table >>>> and they are pretty small, one might maintain them in an RDBMS and read >>>> them once through JDBC into a DataFrame in Spark before doing analytics. >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>> loss, damage or destruction of data or any other property which may arise >>>> from relying on this email's technical content is explicitly disclaimed. >>>> The author will in no case be liable for any monetary damages arising from >>>> such loss, damage or destruction. >>>> >>>> >>>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote: >>>>> Hi Mich, >>>>> >>>>> It seems like an entity resolution problem - looking at different >>>>> representations of an entity - SAINSBURY in this case and matching them >>>>> all together. How dirty is your data in the description - are there stop >>>>> words like SACAT/SMKT etc you can strip off and get the base retailer >>>>> entity ? >>>>> >>>>> Best Regards, >>>>> Sonal >>>>> Founder, Nube Technologies >>>>> Reifier at Strata Hadoop World >>>>> Reifier at Spark Summit 2015 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh >>>>>> <mich.talebza...@gmail.com> wrote: >>>>>> Thanks. >>>>>> >>>>>> I believe there is some catalog of companies that I can get and store it >>>>>> in a table and math the company name to transactiondesciption column. >>>>>> >>>>>> That catalog should have sectors in it. For example company XYZ is under >>>>>> Grocers etc which will make search and grouping much easier. >>>>>> >>>>>> I believe Spark can do it, though I am generally interested on >>>>>> alternative ideas. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>> loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>> damages arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: >>>>>>> Well, if you still want to use windows function for your logic, then >>>>>>> you need to derive a new column out, like "catalog", and use it as part >>>>>>> of grouping logic. >>>>>>> >>>>>>> >>>>>>> Maybe you can use regex for deriving out this new column. The >>>>>>> implementation needs to depend on your data in >>>>>>> "transactiondescription", and regex gives you the most powerful way to >>>>>>> handle your data. >>>>>>> >>>>>>> >>>>>>> This is really not a Spark question, but how to you process your logic >>>>>>> based on the data given. >>>>>>> >>>>>>> >>>>>>> Yong >>>>>>> >>>>>>> >>>>>>> >>>>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>> Sent: Tuesday, August 2, 2016 10:00 AM >>>>>>> To: user @spark >>>>>>> Subject: Extracting key word from a textual column >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Need some ideas. >>>>>>> >>>>>>> Summary: >>>>>>> >>>>>>> I am working on a tool to slice and dice the amount of money I have >>>>>>> spent so far (meaning the whole data sample) on a given retailer so I >>>>>>> have a better idea of where I am wasting the money >>>>>>> >>>>>>> Approach >>>>>>> >>>>>>> Downloaded my bank statements from a given account in csv format from >>>>>>> inception till end of July. Read the data and stored it in ORC table. >>>>>>> >>>>>>> I am interested for all bills that I paid using Debit Card ( >>>>>>> transactiontype = "DEB") that comes out the account directly. >>>>>>> Transactiontype is the three character code lookup that I download as >>>>>>> well. >>>>>>> >>>>>>> scala> ll_18740868.printSchema >>>>>>> root >>>>>>> |-- transactiondate: date (nullable = true) >>>>>>> |-- transactiontype: string (nullable = true) >>>>>>> |-- sortcode: string (nullable = true) >>>>>>> |-- accountnumber: string (nullable = true) >>>>>>> |-- transactiondescription: string (nullable = true) >>>>>>> |-- debitamount: double (nullable = true) >>>>>>> |-- creditamount: double (nullable = true) >>>>>>> |-- balance: double (nullable = true) >>>>>>> >>>>>>> The important fields are transactiondate, transactiontype, >>>>>>> transactiondescription and debitamount >>>>>>> >>>>>>> So using analytics. windowing I can do all sorts of things. For example >>>>>>> this one gives me the last time I spent money on retailer XYZ and the >>>>>>> amount >>>>>>> >>>>>>> SELECT * >>>>>>> FROM ( >>>>>>> select transactiondate, transactiondescription, debitamount >>>>>>> , rank() over (order by transactiondate desc) AS rank >>>>>>> from accounts.ll_18740868 where transactiondescription like >>>>>>> '%XYZ%' >>>>>>> ) tmp >>>>>>> where rank <= 1 >>>>>>> >>>>>>> And its equivalent using Windowing in FP >>>>>>> >>>>>>> import org.apache.spark.sql.expressions.Window >>>>>>> val wSpec = >>>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >>>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >>>>>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show >>>>>>> >>>>>>> >>>>>>> +---------------+----------------------+----+ >>>>>>> |transactiondate|transactiondescription|rank| >>>>>>> +---------------+----------------------+----+ >>>>>>> | 2015-12-15| XYZ LTD CD 4636 | 1| >>>>>>> +---------------+----------------------+----+ >>>>>>> >>>>>>> So far so good. But if I want to find all I spent on each retailer, >>>>>>> then it gets trickier as a retailer appears like below in the column >>>>>>> transactiondescription: >>>>>>> >>>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >>>>>>> +----------------------+ >>>>>>> |transactiondescription| >>>>>>> +----------------------+ >>>>>>> | SAINSBURYS SMKT C...| >>>>>>> | SACAT SAINSBURYS ...| >>>>>>> | SAINSBURY'S SMKT ...| >>>>>>> | SAINSBURYS S/MKT ...| >>>>>>> | SACAT SAINSBURYS ...| >>>>>>> +----------------------+ >>>>>>> >>>>>>> If I look at them I know they all belong to SAINBURYS (food retailer). >>>>>>> I have done some crude grouping and it works somehow >>>>>>> >>>>>>> //define UDF here to handle substring >>>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >>>>>>> s.substring(start, end) } >>>>>>> var cutoff = "CD" // currently used in the statement >>>>>>> val wSpec2 = >>>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>>>> cutoff)-1)) >>>>>>> ll_18740868.where($"transactiontype" === "DEB" && >>>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>>>>> >>>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >>>>>>> >>>>>>> However, I really need to extract the "keyword" retailer name from >>>>>>> transactiondescription column And I need some ideas about the best way >>>>>>> of doing it. Is this possible in Spark? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>>>>> loss, damage or destruction of data or any other property which may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages arising from such loss, damage or destruction. >