Re: Extracting key word from a textual column

Ted Yu Tue, 02 Aug 2016 15:11:55 -0700

+1


> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> If you need to use single inserts, updates, deletes, select why not use hbase 
> with Phoenix? I see it as complementary to the hive / warehouse offering 
> 
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I decided to create a catalog table in Hive ORC and transactional. That 
>> table has two columns of value
>> 
>> transactiondescription === account_table.transactiondescription
>> hashtag String column created from a semi automated process of deriving it 
>> from account_table.transactiondescription
>> Once the process is complete in populating the catalog table then we just 
>> need to create a new DF based on join between catalog table and the 
>> account_table. The join will use hashtag in catalog table to loop over debit 
>> column in account_table for a given hashtag. That is pretty fast as going 
>> through pattern matching is pretty intensive in any application and database 
>> in real time.
>> 
>> So one can build up the catalog table over time as a reference table. I am 
>> sure such tables exist in commercial world.
>> 
>> Anyway after getting results out I know how I am wasting my money on 
>> different things, especially on clothing  etc :)
>> 
>> 
>> HTH
>> 
>> P.S. Also there is an issue with Spark not being able to read data through 
>> Hive transactional tables that have not been compacted yet. Spark just 
>> crashes. If these tables need to be updated regularly say catalog table and 
>> they are pretty small, one might maintain them in an RDBMS and read them 
>> once through JDBC into a DataFrame in Spark before doing analytics.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote:
>>> Hi Mich,
>>> 
>>> It seems like an entity resolution problem - looking at different 
>>> representations of an entity - SAINSBURY in this case and matching them all 
>>> together. How dirty is your data in the description - are there stop words 
>>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>>> 
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies 
>>> Reifier at Strata Hadoop World
>>> Reifier at Spark Summit 2015
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
>>>> <mich.talebza...@gmail.com> wrote:
>>>> Thanks.
>>>> 
>>>> I believe there is some catalog of companies that I can get and store it 
>>>> in a table and math the company name to transactiondesciption column.
>>>> 
>>>> That catalog should have sectors in it. For example company XYZ is under 
>>>> Grocers etc which will make search and grouping much easier.
>>>> 
>>>> I believe Spark can do it, though I am generally interested on alternative 
>>>> ideas.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>>>>> Well, if you still want to use windows function for your logic, then you 
>>>>> need to derive a new column out, like "catalog", and use it as part of 
>>>>> grouping logic.
>>>>> 
>>>>> 
>>>>> Maybe you can use regex for deriving out this new column. The 
>>>>> implementation needs to depend on your data in "transactiondescription", 
>>>>> and regex gives you the most powerful way to handle your data.
>>>>> 
>>>>> 
>>>>> This is really not a Spark question, but how to you process your logic 
>>>>> based on the data given.
>>>>> 
>>>>> 
>>>>> Yong
>>>>> 
>>>>> 
>>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> Sent: Tuesday, August 2, 2016 10:00 AM
>>>>> To: user @spark
>>>>> Subject: Extracting key word from a textual column
>>>>>  
>>>>> Hi,
>>>>> 
>>>>> Need some ideas.
>>>>> 
>>>>> Summary:
>>>>> 
>>>>> I am working on a tool to slice and dice the amount of money I have spent 
>>>>> so far (meaning the whole data sample) on a given retailer so I have a 
>>>>> better idea of where I am wasting the money
>>>>> 
>>>>> Approach
>>>>> 
>>>>> Downloaded my bank statements from a given account in csv format from 
>>>>> inception till end of July. Read the data and stored it in ORC table.
>>>>> 
>>>>> I am interested for all bills that I paid using Debit Card ( 
>>>>> transactiontype = "DEB") that comes out the account directly. 
>>>>> Transactiontype is the three character code lookup that I download as 
>>>>> well.
>>>>> 
>>>>> scala> ll_18740868.printSchema
>>>>> root
>>>>>  |-- transactiondate: date (nullable = true)
>>>>>  |-- transactiontype: string (nullable = true)
>>>>>  |-- sortcode: string (nullable = true)
>>>>>  |-- accountnumber: string (nullable = true)
>>>>>  |-- transactiondescription: string (nullable = true)
>>>>>  |-- debitamount: double (nullable = true)
>>>>>  |-- creditamount: double (nullable = true)
>>>>>  |-- balance: double (nullable = true)
>>>>> 
>>>>> The important fields are transactiondate, transactiontype, 
>>>>> transactiondescription and debitamount
>>>>> 
>>>>> So using analytics. windowing I can do all sorts of things. For example 
>>>>> this one gives me the last time I spent money on retailer XYZ and the 
>>>>> amount
>>>>> 
>>>>> SELECT *
>>>>> FROM (
>>>>>       select transactiondate, transactiondescription, debitamount
>>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>>       from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>>>>>      ) tmp
>>>>> where rank <= 1
>>>>> 
>>>>> And its equivalent using Windowing in FP
>>>>> 
>>>>> import org.apache.spark.sql.expressions.Window
>>>>> val wSpec = 
>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>>>>>  rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>>> 
>>>>> 
>>>>> +---------------+----------------------+----+
>>>>> |transactiondate|transactiondescription|rank|
>>>>> +---------------+----------------------+----+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>>> +---------------+----------------------+----+
>>>>> 
>>>>> So far so good. But if I want to find all I spent on each retailer, then 
>>>>> it gets trickier as a retailer appears like below in the column 
>>>>> transactiondescription:
>>>>> 
>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>>> +----------------------+
>>>>> |transactiondescription|
>>>>> +----------------------+
>>>>> |  SAINSBURYS SMKT C...|
>>>>> |  SACAT SAINSBURYS ...|
>>>>> |  SAINSBURY'S SMKT ...|
>>>>> |  SAINSBURYS S/MKT ...|
>>>>> |  SACAT SAINSBURYS ...|
>>>>> +----------------------+
>>>>> 
>>>>> If I look at them I know they all belong to SAINBURYS (food retailer). I 
>>>>> have done some crude grouping and it works somehow
>>>>> 
>>>>> //define UDF here to handle substring
>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => 
>>>>> s.substring(start, end) }
>>>>> var cutoff = "CD"  // currently used in the statement
>>>>> val wSpec2 = 
>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>  cutoff)-1))
>>>>> ll_18740868.where($"transactiontype" === "DEB" && 
>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>>  
>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>>> 
>>>>> However, I really need to extract the "keyword" retailer name from 
>>>>> transactiondescription column And I need some ideas about the best way of 
>>>>> doing it. Is this possible in Spark?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>

Re: Extracting key word from a textual column

Reply via email to