Re: Extracting key word from a textual column

Jörn Franke Tue, 02 Aug 2016 22:33:21 -0700

I agree with you.


> On 03 Aug 2016, at 01:20, ayan guha <guha.a...@gmail.com> wrote:
> 
> I would stay away from transaction tables until they are fully baked. I do 
> not see why you need to update vs keep inserting with timestamp and while 
> joining derive latest value on the fly.
> 
> But I guess it has became a religious question now :) and I am not unbiased.
> 
>> On 3 Aug 2016 08:51, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:
>> There are many ways of addressing this issue.
>> 
>> Using Hbase with Phoenix adds another layer to the stack which is not 
>> necessary for handful of table and will add to cost (someone else has to 
>> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase 
>> table. It is faster)
>> 
>> There may be say 100 new entries into this catalog table with multiple 
>> updates (not a single DML) to get hashtag right. sometimes it is an 
>> iterative process which results in many deltas.
>> 
>> If that is needed done once a day or on demand, an alternative would be to 
>> insert overwrite the transactional hive table with deltas into a text table 
>> in Hive and present that one to Spark. This allows Spark to see the data.
>> 
>> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
>> issue is with Spark that neither Spark SQL or Spark shell can use that table.
>> 
>> Sounds like an issue for Spark to resolve later.
>> 
>> Another alternative one can leave the transactional table in RDBMS for this 
>> purpose and load it into DF through JDBC interface. It works fine and pretty 
>> fast.
>> 
>> Again these are all workarounds. I discussed this in Hive forum. There 
>> should be a way" to manually compact a transactional table in Hive" (not 
>> possible now) and second point if Hive can see the data in Hive table, why 
>> not Spark?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote:
>>> +1
>>> 
>>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> 
>>>> If you need to use single inserts, updates, deletes, select why not use 
>>>> hbase with Phoenix? I see it as complementary to the hive / warehouse 
>>>> offering 
>>>> 
>>>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I decided to create a catalog table in Hive ORC and transactional. That 
>>>>> table has two columns of value
>>>>> 
>>>>> transactiondescription === account_table.transactiondescription
>>>>> hashtag String column created from a semi automated process of deriving 
>>>>> it from account_table.transactiondescription
>>>>> Once the process is complete in populating the catalog table then we just 
>>>>> need to create a new DF based on join between catalog table and the 
>>>>> account_table. The join will use hashtag in catalog table to loop over 
>>>>> debit column in account_table for a given hashtag. That is pretty fast as 
>>>>> going through pattern matching is pretty intensive in any application and 
>>>>> database in real time.
>>>>> 
>>>>> So one can build up the catalog table over time as a reference table. I 
>>>>> am sure such tables exist in commercial world.
>>>>> 
>>>>> Anyway after getting results out I know how I am wasting my money on 
>>>>> different things, especially on clothing  etc :)
>>>>> 
>>>>> 
>>>>> HTH
>>>>> 
>>>>> P.S. Also there is an issue with Spark not being able to read data 
>>>>> through Hive transactional tables that have not been compacted yet. Spark 
>>>>> just crashes. If these tables need to be updated regularly say catalog 
>>>>> table and they are pretty small, one might maintain them in an RDBMS and 
>>>>> read them once through JDBC into a DataFrame in Spark before doing 
>>>>> analytics.
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote:
>>>>>> Hi Mich,
>>>>>> 
>>>>>> It seems like an entity resolution problem - looking at different 
>>>>>> representations of an entity - SAINSBURY in this case and matching them 
>>>>>> all together. How dirty is your data in the description - are there stop 
>>>>>> words like SACAT/SMKT etc you can strip off and get the base retailer 
>>>>>> entity ?
>>>>>> 
>>>>>> Best Regards,
>>>>>> Sonal
>>>>>> Founder, Nube Technologies 
>>>>>> Reifier at Strata Hadoop World
>>>>>> Reifier at Spark Summit 2015
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> I believe there is some catalog of companies that I can get and store 
>>>>>>> it in a table and math the company name to transactiondesciption column.
>>>>>>> 
>>>>>>> That catalog should have sectors in it. For example company XYZ is 
>>>>>>> under Grocers etc which will make search and grouping much easier.
>>>>>>> 
>>>>>>> I believe Spark can do it, though I am generally interested on 
>>>>>>> alternative ideas.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>>>>>>>> Well, if you still want to use windows function for your logic, then 
>>>>>>>> you need to derive a new column out, like "catalog", and use it as 
>>>>>>>> part of grouping logic.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Maybe you can use regex for deriving out this new column. The 
>>>>>>>> implementation needs to depend on your data in 
>>>>>>>> "transactiondescription", and regex gives you the most powerful way to 
>>>>>>>> handle your data.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This is really not a Spark question, but how to you process your logic 
>>>>>>>> based on the data given.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yong
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  
>>>>>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>>> Sent: Tuesday, August 2, 2016 10:00 AM
>>>>>>>> To: user @spark
>>>>>>>> Subject: Extracting key word from a textual column
>>>>>>>>  
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Need some ideas.
>>>>>>>> 
>>>>>>>> Summary:
>>>>>>>> 
>>>>>>>> I am working on a tool to slice and dice the amount of money I have 
>>>>>>>> spent so far (meaning the whole data sample) on a given retailer so I 
>>>>>>>> have a better idea of where I am wasting the money
>>>>>>>> 
>>>>>>>> Approach
>>>>>>>> 
>>>>>>>> Downloaded my bank statements from a given account in csv format from 
>>>>>>>> inception till end of July. Read the data and stored it in ORC table.
>>>>>>>> 
>>>>>>>> I am interested for all bills that I paid using Debit Card ( 
>>>>>>>> transactiontype = "DEB") that comes out the account directly. 
>>>>>>>> Transactiontype is the three character code lookup that I download as 
>>>>>>>> well.
>>>>>>>> 
>>>>>>>> scala> ll_18740868.printSchema
>>>>>>>> root
>>>>>>>>  |-- transactiondate: date (nullable = true)
>>>>>>>>  |-- transactiontype: string (nullable = true)
>>>>>>>>  |-- sortcode: string (nullable = true)
>>>>>>>>  |-- accountnumber: string (nullable = true)
>>>>>>>>  |-- transactiondescription: string (nullable = true)
>>>>>>>>  |-- debitamount: double (nullable = true)
>>>>>>>>  |-- creditamount: double (nullable = true)
>>>>>>>>  |-- balance: double (nullable = true)
>>>>>>>> 
>>>>>>>> The important fields are transactiondate, transactiontype, 
>>>>>>>> transactiondescription and debitamount
>>>>>>>> 
>>>>>>>> So using analytics. windowing I can do all sorts of things. For 
>>>>>>>> example this one gives me the last time I spent money on retailer XYZ 
>>>>>>>> and the amount
>>>>>>>> 
>>>>>>>> SELECT *
>>>>>>>> FROM (
>>>>>>>>       select transactiondate, transactiondescription, debitamount
>>>>>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>>>>>       from accounts.ll_18740868 where transactiondescription like 
>>>>>>>> '%XYZ%'
>>>>>>>>      ) tmp
>>>>>>>> where rank <= 1
>>>>>>>> 
>>>>>>>> And its equivalent using Windowing in FP
>>>>>>>> 
>>>>>>>> import org.apache.spark.sql.expressions.Window
>>>>>>>> val wSpec = 
>>>>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>>>>>>>>  rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>>>>>> 
>>>>>>>> 
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> |transactiondate|transactiondescription|rank|
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>>>>>> +---------------+----------------------+----+
>>>>>>>> 
>>>>>>>> So far so good. But if I want to find all I spent on each retailer, 
>>>>>>>> then it gets trickier as a retailer appears like below in the column 
>>>>>>>> transactiondescription:
>>>>>>>> 
>>>>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>>>>>> +----------------------+
>>>>>>>> |transactiondescription|
>>>>>>>> +----------------------+
>>>>>>>> |  SAINSBURYS SMKT C...|
>>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>>> |  SAINSBURY'S SMKT ...|
>>>>>>>> |  SAINSBURYS S/MKT ...|
>>>>>>>> |  SACAT SAINSBURYS ...|
>>>>>>>> +----------------------+
>>>>>>>> 
>>>>>>>> If I look at them I know they all belong to SAINBURYS (food retailer). 
>>>>>>>> I have done some crude grouping and it works somehow
>>>>>>>> 
>>>>>>>> //define UDF here to handle substring
>>>>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => 
>>>>>>>> s.substring(start, end) }
>>>>>>>> var cutoff = "CD"  // currently used in the statement
>>>>>>>> val wSpec2 = 
>>>>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>>>>  cutoff)-1))
>>>>>>>> ll_18740868.where($"transactiontype" === "DEB" && 
>>>>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>>>>  
>>>>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>>>>>> 
>>>>>>>> However, I really need to extract the "keyword" retailer name from 
>>>>>>>> transactiondescription column And I need some ideas about the best way 
>>>>>>>> of doing it. Is this possible in Spark?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>  
>>>>>>>> LinkedIn  
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>  
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>> 
>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for 
>>>>>>>> any loss, damage or destruction of data or any other property which 
>>>>>>>> may arise from relying on this email's technical content is explicitly 
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages arising from such loss, damage or destruction.

Re: Extracting key word from a textual column

Reply via email to