Re: Extracting key word from a textual column

2016-08-03 Thread Mich Talebzadeh
Guys,

We are moving in tangent here. The question was what is the easiest way of
finding key words in a string column as in transactiondescription?

I am aware of functions like regexp, instr, patindex etc.

How in general this is done and not necessarily in Spark?

For example. A naive question.

When I type "regexp in Spark" in Google it come back with many urls.
Granted some of them may be repeat url. How does it do that in general? Is
it through hashtag/keyword search or build an index of tags in its massive
library for later use.

I gather if you hit the same keyword many times its hit gets higher etc.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 August 2016 at 06:32, Jörn Franke  wrote:

> I agree with you.
>
> On 03 Aug 2016, at 01:20, ayan guha  wrote:
>
> I would stay away from transaction tables until they are fully baked. I do
> not see why you need to update vs keep inserting with timestamp and while
> joining derive latest value on the fly.
>
> But I guess it has became a religious question now :) and I am not
> unbiased.
> On 3 Aug 2016 08:51, "Mich Talebzadeh"  wrote:
>
>> There are many ways of addressing this issue.
>>
>> Using Hbase with Phoenix adds another layer to the stack which is not
>> necessary for handful of table and will add to cost (someone else has to
>> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase
>> table. It is faster)
>>
>> There may be say 100 new entries into this catalog table with multiple
>> updates (not a single DML) to get hashtag right. sometimes it is an
>> iterative process which results in many deltas.
>>
>> If that is needed done once a day or on demand, an alternative would be
>> to insert overwrite the transactional hive table with deltas into a text
>> table in Hive and present that one to Spark. This allows Spark to see the
>> data.
>>
>> Remember if I use Hive to do the analytics/windowing, there is no issue.
>> The issue is with Spark that neither Spark SQL or Spark shell can use that
>> table.
>>
>> Sounds like an issue for Spark to resolve later.
>>
>> Another alternative one can leave the transactional table in RDBMS for
>> this purpose and load it into DF through JDBC interface. It works fine and
>> pretty fast.
>>
>> Again these are all workarounds. I discussed this in Hive forum. There
>> should be a way" to manually compact a transactional table in Hive" (not
>> possible now) and second point if Hive can see the data in Hive table, why
>> not Spark?
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 23:10, Ted Yu  wrote:
>>
>>> +1
>>>
>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
>>>
>>> If you need to use single inserts, updates, deletes, select why not use
>>> hbase with Phoenix? I see it as complementary to the hive / warehouse
>>> offering
>>>
>>> On 02 Aug 2016, at 22:34, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I decided to create a catalog table in Hive ORC and transactional. That
>>> table has two columns of value
>>>
>>>
>>>1. transactiondescription === account_table.transactiondescription
>>>2. hashtag String column created from a semi automated process of
>>>deriving it from account_table.transactiondescription
>>>
>>> Once the process is complete in populating the catalog table then we
>>> just need to create a new DF based on join between catalog table and the
>>> account_table. The join will use hashtag in catalog table to loop over
>>> debit column in account_table for a given hashtag. That is pretty fast as
>>> going through pattern matching is pretty intensive in any application and
>>> database in real time.
>>>
>>> So one can build up the catalog table over time as a reference table. I
>>> am sure such 

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke
I agree with you.

> On 03 Aug 2016, at 01:20, ayan guha  wrote:
> 
> I would stay away from transaction tables until they are fully baked. I do 
> not see why you need to update vs keep inserting with timestamp and while 
> joining derive latest value on the fly.
> 
> But I guess it has became a religious question now :) and I am not unbiased.
> 
>> On 3 Aug 2016 08:51, "Mich Talebzadeh"  wrote:
>> There are many ways of addressing this issue.
>> 
>> Using Hbase with Phoenix adds another layer to the stack which is not 
>> necessary for handful of table and will add to cost (someone else has to 
>> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase 
>> table. It is faster)
>> 
>> There may be say 100 new entries into this catalog table with multiple 
>> updates (not a single DML) to get hashtag right. sometimes it is an 
>> iterative process which results in many deltas.
>> 
>> If that is needed done once a day or on demand, an alternative would be to 
>> insert overwrite the transactional hive table with deltas into a text table 
>> in Hive and present that one to Spark. This allows Spark to see the data.
>> 
>> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
>> issue is with Spark that neither Spark SQL or Spark shell can use that table.
>> 
>> Sounds like an issue for Spark to resolve later.
>> 
>> Another alternative one can leave the transactional table in RDBMS for this 
>> purpose and load it into DF through JDBC interface. It works fine and pretty 
>> fast.
>> 
>> Again these are all workarounds. I discussed this in Hive forum. There 
>> should be a way" to manually compact a transactional table in Hive" (not 
>> possible now) and second point if Hive can see the data in Hive table, why 
>> not Spark?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 23:10, Ted Yu  wrote:
>>> +1
>>> 
 On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
 
 If you need to use single inserts, updates, deletes, select why not use 
 hbase with Phoenix? I see it as complementary to the hive / warehouse 
 offering 
 
> On 02 Aug 2016, at 22:34, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I decided to create a catalog table in Hive ORC and transactional. That 
> table has two columns of value
> 
> transactiondescription === account_table.transactiondescription
> hashtag String column created from a semi automated process of deriving 
> it from account_table.transactiondescription
> Once the process is complete in populating the catalog table then we just 
> need to create a new DF based on join between catalog table and the 
> account_table. The join will use hashtag in catalog table to loop over 
> debit column in account_table for a given hashtag. That is pretty fast as 
> going through pattern matching is pretty intensive in any application and 
> database in real time.
> 
> So one can build up the catalog table over time as a reference table. I 
> am sure such tables exist in commercial world.
> 
> Anyway after getting results out I know how I am wasting my money on 
> different things, especially on clothing  etc :)
> 
> 
> HTH
> 
> P.S. Also there is an issue with Spark not being able to read data 
> through Hive transactional tables that have not been compacted yet. Spark 
> just crashes. If these tables need to be updated regularly say catalog 
> table and they are pretty small, one might maintain them in an RDBMS and 
> read them once through JDBC into a DataFrame in Spark before doing 
> analytics.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any 
> loss, damage or destruction of data or any other property which may arise 
> from relying on this email's technical content is explicitly disclaimed. 
> The author will in no case be liable for any monetary damages arising 
> from such loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>> Hi Mich,
>> 
>> It 

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke
Phoenix will become another standard query interface of hbase. I do not agree 
that using hbase directly will lead to a faster performance. It always depends 
how you use it. While it is another component, it can make sense to use it. 
This has to be evaluated on a case by case basis. 
If you only want to use Hive you can do the manual compaction as described in 
another answer to this thread.

I anyway recommend most of the people to use a Hadoop distribution where all 
these components are properly integrate and you get support.

> On 03 Aug 2016, at 00:51, Mich Talebzadeh  wrote:
> 
> There are many ways of addressing this issue.
> 
> Using Hbase with Phoenix adds another layer to the stack which is not 
> necessary for handful of table and will add to cost (someone else has to know 
> about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase table. 
> It is faster)
> 
> There may be say 100 new entries into this catalog table with multiple 
> updates (not a single DML) to get hashtag right. sometimes it is an iterative 
> process which results in many deltas.
> 
> If that is needed done once a day or on demand, an alternative would be to 
> insert overwrite the transactional hive table with deltas into a text table 
> in Hive and present that one to Spark. This allows Spark to see the data.
> 
> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
> issue is with Spark that neither Spark SQL or Spark shell can use that table.
> 
> Sounds like an issue for Spark to resolve later.
> 
> Another alternative one can leave the transactional table in RDBMS for this 
> purpose and load it into DF through JDBC interface. It works fine and pretty 
> fast.
> 
> Again these are all workarounds. I discussed this in Hive forum. There should 
> be a way" to manually compact a transactional table in Hive" (not possible 
> now) and second point if Hive can see the data in Hive table, why not Spark?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 23:10, Ted Yu  wrote:
>> +1
>> 
>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
>>> 
>>> If you need to use single inserts, updates, deletes, select why not use 
>>> hbase with Phoenix? I see it as complementary to the hive / warehouse 
>>> offering 
>>> 
 On 02 Aug 2016, at 22:34, Mich Talebzadeh  
 wrote:
 
 Hi,
 
 I decided to create a catalog table in Hive ORC and transactional. That 
 table has two columns of value
 
 transactiondescription === account_table.transactiondescription
 hashtag String column created from a semi automated process of deriving it 
 from account_table.transactiondescription
 Once the process is complete in populating the catalog table then we just 
 need to create a new DF based on join between catalog table and the 
 account_table. The join will use hashtag in catalog table to loop over 
 debit column in account_table for a given hashtag. That is pretty fast as 
 going through pattern matching is pretty intensive in any application and 
 database in real time.
 
 So one can build up the catalog table over time as a reference table. I am 
 sure such tables exist in commercial world.
 
 Anyway after getting results out I know how I am wasting my money on 
 different things, especially on clothing  etc :)
 
 
 HTH
 
 P.S. Also there is an issue with Spark not being able to read data through 
 Hive transactional tables that have not been compacted yet. Spark just 
 crashes. If these tables need to be updated regularly say catalog table 
 and they are pretty small, one might maintain them in an RDBMS and read 
 them once through JDBC into a DataFrame in Spark before doing analytics.
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
 
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
> On 2 August 2016 at 17:56, Sonal Goyal 

Re: Extracting key word from a textual column

2016-08-02 Thread ayan guha
I would stay away from transaction tables until they are fully baked. I do
not see why you need to update vs keep inserting with timestamp and while
joining derive latest value on the fly.

But I guess it has became a religious question now :) and I am not
unbiased.
On 3 Aug 2016 08:51, "Mich Talebzadeh"  wrote:

> There are many ways of addressing this issue.
>
> Using Hbase with Phoenix adds another layer to the stack which is not
> necessary for handful of table and will add to cost (someone else has to
> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase
> table. It is faster)
>
> There may be say 100 new entries into this catalog table with multiple
> updates (not a single DML) to get hashtag right. sometimes it is an
> iterative process which results in many deltas.
>
> If that is needed done once a day or on demand, an alternative would be to
> insert overwrite the transactional hive table with deltas into a text table
> in Hive and present that one to Spark. This allows Spark to see the data.
>
> Remember if I use Hive to do the analytics/windowing, there is no issue.
> The issue is with Spark that neither Spark SQL or Spark shell can use that
> table.
>
> Sounds like an issue for Spark to resolve later.
>
> Another alternative one can leave the transactional table in RDBMS for
> this purpose and load it into DF through JDBC interface. It works fine and
> pretty fast.
>
> Again these are all workarounds. I discussed this in Hive forum. There
> should be a way" to manually compact a transactional table in Hive" (not
> possible now) and second point if Hive can see the data in Hive table, why
> not Spark?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 23:10, Ted Yu  wrote:
>
>> +1
>>
>> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
>>
>> If you need to use single inserts, updates, deletes, select why not use
>> hbase with Phoenix? I see it as complementary to the hive / warehouse
>> offering
>>
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I decided to create a catalog table in Hive ORC and transactional. That
>> table has two columns of value
>>
>>
>>1. transactiondescription === account_table.transactiondescription
>>2. hashtag String column created from a semi automated process of
>>deriving it from account_table.transactiondescription
>>
>> Once the process is complete in populating the catalog table then we just
>> need to create a new DF based on join between catalog table and the
>> account_table. The join will use hashtag in catalog table to loop over
>> debit column in account_table for a given hashtag. That is pretty fast as
>> going through pattern matching is pretty intensive in any application and
>> database in real time.
>>
>> So one can build up the catalog table over time as a reference table. I
>> am sure such tables exist in commercial world.
>>
>> Anyway after getting results out I know how I am wasting my money on
>> different things, especially on clothing  etc :)
>>
>>
>> HTH
>>
>> P.S. Also there is an issue with Spark not being able to read data
>> through Hive transactional tables that have not been compacted yet. Spark
>> just crashes. If these tables need to be updated regularly say catalog
>> table and they are pretty small, one might maintain them in an RDBMS and
>> read them once through JDBC into a DataFrame in Spark before doing
>> analytics.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>>
>>> Hi Mich,
>>>
>>> It seems like an entity resolution problem - looking at different
>>> representations of an entity - SAINSBURY in this case and matching them all
>>> together. How dirty is your 

Re: Extracting key word from a textual column

2016-08-02 Thread Ted Yu
+1

> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
> 
> If you need to use single inserts, updates, deletes, select why not use hbase 
> with Phoenix? I see it as complementary to the hive / warehouse offering 
> 
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh  wrote:
>> 
>> Hi,
>> 
>> I decided to create a catalog table in Hive ORC and transactional. That 
>> table has two columns of value
>> 
>> transactiondescription === account_table.transactiondescription
>> hashtag String column created from a semi automated process of deriving it 
>> from account_table.transactiondescription
>> Once the process is complete in populating the catalog table then we just 
>> need to create a new DF based on join between catalog table and the 
>> account_table. The join will use hashtag in catalog table to loop over debit 
>> column in account_table for a given hashtag. That is pretty fast as going 
>> through pattern matching is pretty intensive in any application and database 
>> in real time.
>> 
>> So one can build up the catalog table over time as a reference table. I am 
>> sure such tables exist in commercial world.
>> 
>> Anyway after getting results out I know how I am wasting my money on 
>> different things, especially on clothing  etc :)
>> 
>> 
>> HTH
>> 
>> P.S. Also there is an issue with Spark not being able to read data through 
>> Hive transactional tables that have not been compacted yet. Spark just 
>> crashes. If these tables need to be updated regularly say catalog table and 
>> they are pretty small, one might maintain them in an RDBMS and read them 
>> once through JDBC into a DataFrame in Spark before doing analytics.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>>> Hi Mich,
>>> 
>>> It seems like an entity resolution problem - looking at different 
>>> representations of an entity - SAINSBURY in this case and matching them all 
>>> together. How dirty is your data in the description - are there stop words 
>>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>>> 
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies 
>>> Reifier at Strata Hadoop World
>>> Reifier at Spark Summit 2015
>>> 
>>> 
>>> 
>>> 
>>> 
 On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
  wrote:
 Thanks.
 
 I believe there is some catalog of companies that I can get and store it 
 in a table and math the company name to transactiondesciption column.
 
 That catalog should have sectors in it. For example company XYZ is under 
 Grocers etc which will make search and grouping much easier.
 
 I believe Spark can do it, though I am generally interested on alternative 
 ideas.
 
 
 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
 
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
> On 2 August 2016 at 16:26, Yong Zhang  wrote:
> Well, if you still want to use windows function for your logic, then you 
> need to derive a new column out, like "catalog", and use it as part of 
> grouping logic.
> 
> 
> Maybe you can use regex for deriving out this new column. The 
> implementation needs to depend on your data in "transactiondescription", 
> and regex gives you the most powerful way to handle your data.
> 
> 
> This is really not a Spark question, but how to you process your logic 
> based on the data given.
> 
> 
> Yong
> 
> 
> From: Mich Talebzadeh 
> Sent: Tuesday, August 2, 2016 10:00 AM
> To: user @spark
> Subject: Extracting key word from a textual column
>  
> Hi,
> 
> Need some ideas.
> 
> Summary:
> 
> I am working on a tool to slice and dice the amount of money I have spent 
> so far (meaning the whole data sample) on a given retailer so I have a 
> better idea of where I 

Re: Extracting key word from a textual column

2016-08-02 Thread Mich Talebzadeh
Hi,

I decided to create a catalog table in Hive ORC and transactional. That
table has two columns of value


   1. transactiondescription === account_table.transactiondescription
   2. hashtag String column created from a semi automated process of
   deriving it from account_table.transactiondescription

Once the process is complete in populating the catalog table then we just
need to create a new DF based on join between catalog table and the
account_table. The join will use hashtag in catalog table to loop over
debit column in account_table for a given hashtag. That is pretty fast as
going through pattern matching is pretty intensive in any application and
database in real time.

So one can build up the catalog table over time as a reference table. I am
sure such tables exist in commercial world.

Anyway after getting results out I know how I am wasting my money on
different things, especially on clothing  etc :)


HTH

P.S. Also there is an issue with Spark not being able to read data through
Hive transactional tables that have not been compacted yet. Spark just
crashes. If these tables need to be updated regularly say catalog table and
they are pretty small, one might maintain them in an RDBMS and read them
once through JDBC into a DataFrame in Spark before doing analytics.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 17:56, Sonal Goyal  wrote:

> Hi Mich,
>
> It seems like an entity resolution problem - looking at different
> representations of an entity - SAINSBURY in this case and matching them all
> together. How dirty is your data in the description - are there stop words
> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh  > wrote:
>
>> Thanks.
>>
>> I believe there is some catalog of companies that I can get and store it
>> in a table and math the company name to transactiondesciption column.
>>
>> That catalog should have sectors in it. For example company XYZ is under
>> Grocers etc which will make search and grouping much easier.
>>
>> I believe Spark can do it, though I am generally interested on
>> alternative ideas.
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 16:26, Yong Zhang  wrote:
>>
>>> Well, if you still want to use windows function for your logic, then you
>>> need to derive a new column out, like "catalog", and use it as part of
>>> grouping logic.
>>>
>>>
>>> Maybe you can use regex for deriving out this new column. The
>>> implementation needs to depend on your data in "transactiondescription",
>>> and regex gives you the most powerful way to handle your data.
>>>
>>>
>>> This is really not a Spark question, but how to you process your logic
>>> based on the data given.
>>>
>>>
>>> Yong
>>>
>>>
>>> --
>>> *From:* Mich Talebzadeh 
>>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>>> *To:* user @spark
>>> *Subject:* Extracting key word from a textual column
>>>
>>> Hi,
>>>
>>> Need some ideas.
>>>
>>> *Summary:*
>>>
>>> I am working on a tool to slice and dice the amount of money I have
>>> spent so far (meaning the whole data sample) on a given retailer so I have
>>> a better idea of where I am wasting the money
>>>
>>> *Approach*
>>>
>>> Downloaded my bank statements from a given account in csv format from
>>> inception till end of July. Read the data and stored it in ORC table.
>>>
>>> I am interested for 

Re: Extracting key word from a textual column

2016-08-02 Thread Sonal Goyal
Hi Mich,

It seems like an entity resolution problem - looking at different
representations of an entity - SAINSBURY in this case and matching them all
together. How dirty is your data in the description - are there stop words
like SACAT/SMKT etc you can strip off and get the base retailer entity ?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
wrote:

> Thanks.
>
> I believe there is some catalog of companies that I can get and store it
> in a table and math the company name to transactiondesciption column.
>
> That catalog should have sectors in it. For example company XYZ is under
> Grocers etc which will make search and grouping much easier.
>
> I believe Spark can do it, though I am generally interested on alternative
> ideas.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 16:26, Yong Zhang  wrote:
>
>> Well, if you still want to use windows function for your logic, then you
>> need to derive a new column out, like "catalog", and use it as part of
>> grouping logic.
>>
>>
>> Maybe you can use regex for deriving out this new column. The
>> implementation needs to depend on your data in "transactiondescription",
>> and regex gives you the most powerful way to handle your data.
>>
>>
>> This is really not a Spark question, but how to you process your logic
>> based on the data given.
>>
>>
>> Yong
>>
>>
>> --
>> *From:* Mich Talebzadeh 
>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>> *To:* user @spark
>> *Subject:* Extracting key word from a textual column
>>
>> Hi,
>>
>> Need some ideas.
>>
>> *Summary:*
>>
>> I am working on a tool to slice and dice the amount of money I have spent
>> so far (meaning the whole data sample) on a given retailer so I have a
>> better idea of where I am wasting the money
>>
>> *Approach*
>>
>> Downloaded my bank statements from a given account in csv format from
>> inception till end of July. Read the data and stored it in ORC table.
>>
>> I am interested for all bills that I paid using Debit Card (
>> transactiontype = "DEB") that comes out the account directly.
>> Transactiontype is the three character code lookup that I download as well.
>>
>> scala> ll_18740868.printSchema
>> root
>>  |-- transactiondate: date (nullable = true)
>>  |-- transactiontype: string (nullable = true)
>>  |-- sortcode: string (nullable = true)
>>  |-- accountnumber: string (nullable = true)
>>  |-- transactiondescription: string (nullable = true)
>>  |-- debitamount: double (nullable = true)
>>  |-- creditamount: double (nullable = true)
>>  |-- balance: double (nullable = true)
>>
>> The important fields are transactiondate, transactiontype,
>> transactiondescription and debitamount
>>
>> So using analytics. windowing I can do all sorts of things. For example
>> this one gives me the last time I spent money on retailer XYZ and the amount
>>
>> SELECT *
>> FROM (
>>   select transactiondate, transactiondescription, debitamount
>>   , rank() over (order by transactiondate desc) AS rank
>>   from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>>  ) tmp
>> where rank <= 1
>>
>> And its equivalent using Windowing in FP
>>
>> import org.apache.spark.sql.expressions.Window
>> val wSpec =
>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>
>>
>> +---+--++
>> |transactiondate|transactiondescription|rank|
>> +---+--++
>> | 2015-12-15|  XYZ LTD CD 4636 |   1|
>> +---+--++
>>
>> So far so good. But if I want to find all I spent on each retailer, then
>> it gets trickier as a retailer appears like below in the column
>> transactiondescription:
>>
>>
>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>> 

Re: Extracting key word from a textual column

2016-08-02 Thread Mich Talebzadeh
Thanks.

I believe there is some catalog of companies that I can get and store it in
a table and math the company name to transactiondesciption column.

That catalog should have sectors in it. For example company XYZ is under
Grocers etc which will make search and grouping much easier.

I believe Spark can do it, though I am generally interested on alternative
ideas.





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 16:26, Yong Zhang  wrote:

> Well, if you still want to use windows function for your logic, then you
> need to derive a new column out, like "catalog", and use it as part of
> grouping logic.
>
>
> Maybe you can use regex for deriving out this new column. The
> implementation needs to depend on your data in "transactiondescription",
> and regex gives you the most powerful way to handle your data.
>
>
> This is really not a Spark question, but how to you process your logic
> based on the data given.
>
>
> Yong
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* Tuesday, August 2, 2016 10:00 AM
> *To:* user @spark
> *Subject:* Extracting key word from a textual column
>
> Hi,
>
> Need some ideas.
>
> *Summary:*
>
> I am working on a tool to slice and dice the amount of money I have spent
> so far (meaning the whole data sample) on a given retailer so I have a
> better idea of where I am wasting the money
>
> *Approach*
>
> Downloaded my bank statements from a given account in csv format from
> inception till end of July. Read the data and stored it in ORC table.
>
> I am interested for all bills that I paid using Debit Card (
> transactiontype = "DEB") that comes out the account directly.
> Transactiontype is the three character code lookup that I download as well.
>
> scala> ll_18740868.printSchema
> root
>  |-- transactiondate: date (nullable = true)
>  |-- transactiontype: string (nullable = true)
>  |-- sortcode: string (nullable = true)
>  |-- accountnumber: string (nullable = true)
>  |-- transactiondescription: string (nullable = true)
>  |-- debitamount: double (nullable = true)
>  |-- creditamount: double (nullable = true)
>  |-- balance: double (nullable = true)
>
> The important fields are transactiondate, transactiontype,
> transactiondescription and debitamount
>
> So using analytics. windowing I can do all sorts of things. For example
> this one gives me the last time I spent money on retailer XYZ and the amount
>
> SELECT *
> FROM (
>   select transactiondate, transactiondescription, debitamount
>   , rank() over (order by transactiondate desc) AS rank
>   from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>  ) tmp
> where rank <= 1
>
> And its equivalent using Windowing in FP
>
> import org.apache.spark.sql.expressions.Window
> val wSpec =
> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>
>
> +---+--++
> |transactiondate|transactiondescription|rank|
> +---+--++
> | 2015-12-15|  XYZ LTD CD 4636 |   1|
> +---+--++
>
> So far so good. But if I want to find all I spent on each retailer, then
> it gets trickier as a retailer appears like below in the column
> transactiondescription:
>
>
> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
> +--+
> |transactiondescription|
> +--+
> |  SAINSBURYS SMKT C...|
> |  SACAT SAINSBURYS ...|
> |  SAINSBURY'S SMKT ...|
> |  SAINSBURYS S/MKT ...|
> |  SACAT SAINSBURYS ...|
> +--+
>
> If I look at them I know they all belong to SAINBURYS (food retailer). I
> have done some crude grouping and it works somehow
>
> //define UDF here to handle substring
> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
> s.substring(start, end) }
> var cutoff = "CD"  // currently used in the statement
> val wSpec2 =
> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
> cutoff)-1))
> ll_18740868.where($"transactiontype" === "DEB" &&
> 

Re: Extracting key word from a textual column

2016-08-02 Thread Yong Zhang
Well, if you still want to use windows function for your logic, then you need 
to derive a new column out, like "catalog", and use it as part of grouping 
logic.


Maybe you can use regex for deriving out this new column. The implementation 
needs to depend on your data in "transactiondescription", and regex gives you 
the most powerful way to handle your data.


This is really not a Spark question, but how to you process your logic based on 
the data given.


Yong



From: Mich Talebzadeh 
Sent: Tuesday, August 2, 2016 10:00 AM
To: user @spark
Subject: Extracting key word from a textual column

Hi,

Need some ideas.

Summary:

I am working on a tool to slice and dice the amount of money I have spent so 
far (meaning the whole data sample) on a given retailer so I have a better idea 
of where I am wasting the money

Approach

Downloaded my bank statements from a given account in csv format from inception 
till end of July. Read the data and stored it in ORC table.

I am interested for all bills that I paid using Debit Card ( transactiontype = 
"DEB") that comes out the account directly. Transactiontype is the three 
character code lookup that I download as well.

scala> ll_18740868.printSchema
root
 |-- transactiondate: date (nullable = true)
 |-- transactiontype: string (nullable = true)
 |-- sortcode: string (nullable = true)
 |-- accountnumber: string (nullable = true)
 |-- transactiondescription: string (nullable = true)
 |-- debitamount: double (nullable = true)
 |-- creditamount: double (nullable = true)
 |-- balance: double (nullable = true)

The important fields are transactiondate, transactiontype, 
transactiondescription and debitamount

So using analytics. windowing I can do all sorts of things. For example this 
one gives me the last time I spent money on retailer XYZ and the amount

SELECT *
FROM (
  select transactiondate, transactiondescription, debitamount
  , rank() over (order by transactiondate desc) AS rank
  from accounts.ll_18740868 where transactiondescription like '%XYZ%'
 ) tmp
where rank <= 1

And its equivalent using Windowing in FP

import org.apache.spark.sql.expressions.Window
val wSpec = 
Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
 rank().over(wSpec).as("rank")).filter($"rank"===1).show


+---+--++
|transactiondate|transactiondescription|rank|
+---+--++
| 2015-12-15|  XYZ LTD CD 4636 |   1|
+---+--++

So far so good. But if I want to find all I spent on each retailer, then it 
gets trickier as a retailer appears like below in the column 
transactiondescription:

ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
+--+
|transactiondescription|
+--+
|  SAINSBURYS SMKT C...|
|  SACAT SAINSBURYS ...|
|  SAINSBURY'S SMKT ...|
|  SAINSBURYS S/MKT ...|
|  SACAT SAINSBURYS ...|
+--+

If I look at them I know they all belong to SAINBURYS (food retailer). I have 
done some crude grouping and it works somehow

//define UDF here to handle substring
val SubstrUDF = udf { (s: String, start: Int, end: Int) => s.substring(start, 
end) }
var cutoff = "CD"  // currently used in the statement
val wSpec2 = 
Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 cutoff)-1))
ll_18740868.where($"transactiontype" === "DEB" && 
($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 
cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)

However, I really need to extract the "keyword" retailer name from 
transactiondescription column And I need some ideas about the best way of doing 
it. Is this possible in Spark?


Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.