Re: Extracting key word from a textual column

Jörn Franke Tue, 02 Aug 2016 22:33:07 -0700

Phoenix will become another standard query interface of hbase. I do not agree 
that using hbase directly will lead to a faster performance. It always depends 
how you use it. While it is another component, it can make sense to use it. 
This has to be evaluated on a case by case basis. 
If you only want to use Hive you can do the manual compaction as described in 
another answer to this thread.


I anyway recommend most of the people to use a Hadoop distribution where all 
these components are properly integrate and you get support.

> On 03 Aug 2016, at 00:51, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> There are many ways of addressing this issue.
> 
> Using Hbase with Phoenix adds another layer to the stack which is not 
> necessary for handful of table and will add to cost (someone else has to know 
> about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase table. 
> It is faster)
> 
> There may be say 100 new entries into this catalog table with multiple 
> updates (not a single DML) to get hashtag right. sometimes it is an iterative 
> process which results in many deltas.
> 
> If that is needed done once a day or on demand, an alternative would be to 
> insert overwrite the transactional hive table with deltas into a text table 
> in Hive and present that one to Spark. This allows Spark to see the data.
> 
> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
> issue is with Spark that neither Spark SQL or Spark shell can use that table.
> 
> Sounds like an issue for Spark to resolve later.
> 
> Another alternative one can leave the transactional table in RDBMS for this 
> purpose and load it into DF through JDBC interface. It works fine and pretty 
> fast.
> 
> Again these are all workarounds. I discussed this in Hive forum. There should 
> be a way" to manually compact a transactional table in Hive" (not possible 
> now) and second point if Hive can see the data in Hive table, why not Spark?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote:
>> +1
>> 
>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>> 
>>> If you need to use single inserts, updates, deletes, select why not use 
>>> hbase with Phoenix? I see it as complementary to the hive / warehouse 
>>> offering 
>>> 
>>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I decided to create a catalog table in Hive ORC and transactional. That 
>>>> table has two columns of value
>>>> 
>>>> transactiondescription === account_table.transactiondescription
>>>> hashtag String column created from a semi automated process of deriving it 
>>>> from account_table.transactiondescription
>>>> Once the process is complete in populating the catalog table then we just 
>>>> need to create a new DF based on join between catalog table and the 
>>>> account_table. The join will use hashtag in catalog table to loop over 
>>>> debit column in account_table for a given hashtag. That is pretty fast as 
>>>> going through pattern matching is pretty intensive in any application and 
>>>> database in real time.
>>>> 
>>>> So one can build up the catalog table over time as a reference table. I am 
>>>> sure such tables exist in commercial world.
>>>> 
>>>> Anyway after getting results out I know how I am wasting my money on 
>>>> different things, especially on clothing  etc :)
>>>> 
>>>> 
>>>> HTH
>>>> 
>>>> P.S. Also there is an issue with Spark not being able to read data through 
>>>> Hive transactional tables that have not been compacted yet. Spark just 
>>>> crashes. If these tables need to be updated regularly say catalog table 
>>>> and they are pretty small, one might maintain them in an RDBMS and read 
>>>> them once through JDBC into a DataFrame in Spark before doing analytics.
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote:
>>>>> Hi Mich,
>>>>> 
>>>>> It seems like an entity resolution problem - looking at different 
>>>>> representations of an entity - SAINSBURY in this case and matching them 
>>>>> all together. How dirty is your data in the description - are there stop 
>>>>> words like SACAT/SMKT etc you can strip off and get the base retailer 
>>>>> entity ?
>>>>> 
>>>>> Best Regards,
>>>>> Sonal
>>>>> Founder, Nube Technologies 
>>>>> Reifier at Strata Hadoop World
>>>>> Reifier at Spark Summit 2015
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>> Thanks.
>>>>>> 
>>>>>> I believe there is some catalog of companies that I can get and store it 
>>>>>> in a table and math the company name to transactiondesciption column.
>>>>>> 
>>>>>> That catalog should have sectors in it. For example company XYZ is under 
>>>>>> Grocers etc which will make search and grouping much easier.
>>>>>> 
>>>>>> I believe Spark can do it, though I am generally interested on 
>>>>>> alternative ideas.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>>>>>>> Well, if you still want to use windows function for your logic, then 
>>>>>>> you need to derive a new column out, like "catalog", and use it as part 
>>>>>>> of grouping logic.
>>>>>>> 
>>>>>>> 
>>>>>>> Maybe you can use regex for deriving out this new column. The 
>>>>>>> implementation needs to depend on your data in 
>>>>>>> "transactiondescription", and regex gives you the most powerful way to 
>>>>>>> handle your data.
>>>>>>> 
>>>>>>> 
>>>>>>> This is really not a Spark question, but how to you process your logic 
>>>>>>> based on the data given.
>>>>>>> 
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>> Sent: Tuesday, August 2, 2016 10:00 AM
>>>>>>> To: user @spark
>>>>>>> Subject: Extracting key word from a textual column
>>>>>>>  
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Need some ideas.
>>>>>>> 
>>>>>>> Summary:
>>>>>>> 
>>>>>>> I am working on a tool to slice and dice the amount of money I have 
>>>>>>> spent so far (meaning the whole data sample) on a given retailer so I 
>>>>>>> have a better idea of where I am wasting the money
>>>>>>> 
>>>>>>> Approach
>>>>>>> 
>>>>>>> Downloaded my bank statements from a given account in csv format from 
>>>>>>> inception till end of July. Read the data and stored it in ORC table.
>>>>>>> 
>>>>>>> I am interested for all bills that I paid using Debit Card ( 
>>>>>>> transactiontype = "DEB") that comes out the account directly. 
>>>>>>> Transactiontype is the three character code lookup that I download as 
>>>>>>> well.
>>>>>>> 
>>>>>>> scala> ll_18740868.printSchema
>>>>>>> root
>>>>>>>  |-- transactiondate: date (nullable = true)
>>>>>>>  |-- transactiontype: string (nullable = true)
>>>>>>>  |-- sortcode: string (nullable = true)
>>>>>>>  |-- accountnumber: string (nullable = true)
>>>>>>>  |-- transactiondescription: string (nullable = true)
>>>>>>>  |-- debitamount: double (nullable = true)
>>>>>>>  |-- creditamount: double (nullable = true)
>>>>>>>  |-- balance: double (nullable = true)
>>>>>>> 
>>>>>>> The important fields are transactiondate, transactiontype, 
>>>>>>> transactiondescription and debitamount
>>>>>>> 
>>>>>>> So using analytics. windowing I can do all sorts of things. For example 
>>>>>>> this one gives me the last time I spent money on retailer XYZ and the 
>>>>>>> amount
>>>>>>> 
>>>>>>> SELECT *
>>>>>>> FROM (
>>>>>>>       select transactiondate, transactiondescription, debitamount
>>>>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>>>>       from accounts.ll_18740868 where transactiondescription like 
>>>>>>> '%XYZ%'
>>>>>>>      ) tmp
>>>>>>> where rank <= 1
>>>>>>> 
>>>>>>> And its equivalent using Windowing in FP
>>>>>>> 
>>>>>>> import org.apache.spark.sql.expressions.Window
>>>>>>> val wSpec = 
>>>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>>>>>>>  rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>>>>> 
>>>>>>> 
>>>>>>> +---------------+----------------------+----+
>>>>>>> |transactiondate|transactiondescription|rank|
>>>>>>> +---------------+----------------------+----+
>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>>>>> +---------------+----------------------+----+
>>>>>>> 
>>>>>>> So far so good. But if I want to find all I spent on each retailer, 
>>>>>>> then it gets trickier as a retailer appears like below in the column 
>>>>>>> transactiondescription:
>>>>>>> 
>>>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>>>>> +----------------------+
>>>>>>> |transactiondescription|
>>>>>>> +----------------------+
>>>>>>> |  SAINSBURYS SMKT C...|
>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>> |  SAINSBURY'S SMKT ...|
>>>>>>> |  SAINSBURYS S/MKT ...|
>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>> +----------------------+
>>>>>>> 
>>>>>>> If I look at them I know they all belong to SAINBURYS (food retailer). 
>>>>>>> I have done some crude grouping and it works somehow
>>>>>>> 
>>>>>>> //define UDF here to handle substring
>>>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => 
>>>>>>> s.substring(start, end) }
>>>>>>> var cutoff = "CD"  // currently used in the statement
>>>>>>> val wSpec2 = 
>>>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>>>  cutoff)-1))
>>>>>>> ll_18740868.where($"transactiontype" === "DEB" && 
>>>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>>>  
>>>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>>>>> 
>>>>>>> However, I really need to extract the "keyword" retailer name from 
>>>>>>> transactiondescription column And I need some ideas about the best way 
>>>>>>> of doing it. Is this possible in Spark?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>

Re: Extracting key word from a textual column

Reply via email to