Re: Extracting key word from a textual column

ayan guha Tue, 02 Aug 2016 16:21:06 -0700

I would stay away from transaction tables until they are fully baked. I do
not see why you need to update vs keep inserting with timestamp and while
joining derive latest value on the fly.


But I guess it has became a religious question now :) and I am not
unbiased.
On 3 Aug 2016 08:51, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:

> There are many ways of addressing this issue.
>
> Using Hbase with Phoenix adds another layer to the stack which is not
> necessary for handful of table and will add to cost (someone else has to
> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase
> table. It is faster)
>
> There may be say 100 new entries into this catalog table with multiple
> updates (not a single DML) to get hashtag right. sometimes it is an
> iterative process which results in many deltas.
>
> If that is needed done once a day or on demand, an alternative would be to
> insert overwrite the transactional hive table with deltas into a text table
> in Hive and present that one to Spark. This allows Spark to see the data.
>
> Remember if I use Hive to do the analytics/windowing, there is no issue.
> The issue is with Spark that neither Spark SQL or Spark shell can use that
> table.
>
> Sounds like an issue for Spark to resolve later.
>
> Another alternative one can leave the transactional table in RDBMS for
> this purpose and load it into DF through JDBC interface. It works fine and
> pretty fast.
>
> Again these are all workarounds. I discussed this in Hive forum. There
> should be a way" to manually compact a transactional table in Hive" (not
> possible now) and second point if Hive can see the data in Hive table, why
> not Spark?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 23:10, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> +1
>>
>> On Aug 2, 2016, at 2:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>> If you need to use single inserts, updates, deletes, select why not use
>> hbase with Phoenix? I see it as complementary to the hive / warehouse
>> offering
>>
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I decided to create a catalog table in Hive ORC and transactional. That
>> table has two columns of value
>>
>>
>>    1. transactiondescription === account_table.transactiondescription
>>    2. hashtag String column created from a semi automated process of
>>    deriving it from account_table.transactiondescription
>>
>> Once the process is complete in populating the catalog table then we just
>> need to create a new DF based on join between catalog table and the
>> account_table. The join will use hashtag in catalog table to loop over
>> debit column in account_table for a given hashtag. That is pretty fast as
>> going through pattern matching is pretty intensive in any application and
>> database in real time.
>>
>> So one can build up the catalog table over time as a reference table. I
>> am sure such tables exist in commercial world.
>>
>> Anyway after getting results out I know how I am wasting my money on
>> different things, especially on clothing  etc :)
>>
>>
>> HTH
>>
>> P.S. Also there is an issue with Spark not being able to read data
>> through Hive transactional tables that have not been compacted yet. Spark
>> just crashes. If these tables need to be updated regularly say catalog
>> table and they are pretty small, one might maintain them in an RDBMS and
>> read them once through JDBC into a DataFrame in Spark before doing
>> analytics.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> It seems like an entity resolution problem - looking at different
>>> representations of an entity - SAINSBURY in this case and matching them all
>>> together. How dirty is your data in the description - are there stop words
>>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>>>
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies <http://www.nubetech.co>
>>> Reifier at Strata Hadoop World
>>> <https://www.youtube.com/watch?v=eD3LkpPQIgM>
>>> Reifier at Spark Summit 2015
>>> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Thanks.
>>>>
>>>> I believe there is some catalog of companies that I can get and store
>>>> it in a table and math the company name to transactiondesciption column.
>>>>
>>>> That catalog should have sectors in it. For example company XYZ is
>>>> under Grocers etc which will make search and grouping much easier.
>>>>
>>>> I believe Spark can do it, though I am generally interested on
>>>> alternative ideas.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>>>>
>>>>> Well, if you still want to use windows function for your logic, then
>>>>> you need to derive a new column out, like "catalog", and use it as part of
>>>>> grouping logic.
>>>>>
>>>>>
>>>>> Maybe you can use regex for deriving out this new column. The
>>>>> implementation needs to depend on your data in "transactiondescription",
>>>>> and regex gives you the most powerful way to handle your data.
>>>>>
>>>>>
>>>>> This is really not a Spark question, but how to you process your logic
>>>>> based on the data given.
>>>>>
>>>>>
>>>>> Yong
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>>>>> *To:* user @spark
>>>>> *Subject:* Extracting key word from a textual column
>>>>>
>>>>> Hi,
>>>>>
>>>>> Need some ideas.
>>>>>
>>>>> *Summary:*
>>>>>
>>>>> I am working on a tool to slice and dice the amount of money I have
>>>>> spent so far (meaning the whole data sample) on a given retailer so I have
>>>>> a better idea of where I am wasting the money
>>>>>
>>>>> *Approach*
>>>>>
>>>>> Downloaded my bank statements from a given account in csv format from
>>>>> inception till end of July. Read the data and stored it in ORC table.
>>>>>
>>>>> I am interested for all bills that I paid using Debit Card (
>>>>> transactiontype = "DEB") that comes out the account directly.
>>>>> Transactiontype is the three character code lookup that I download as 
>>>>> well.
>>>>>
>>>>> scala> ll_18740868.printSchema
>>>>> root
>>>>>  |-- transactiondate: date (nullable = true)
>>>>>  |-- transactiontype: string (nullable = true)
>>>>>  |-- sortcode: string (nullable = true)
>>>>>  |-- accountnumber: string (nullable = true)
>>>>>  |-- transactiondescription: string (nullable = true)
>>>>>  |-- debitamount: double (nullable = true)
>>>>>  |-- creditamount: double (nullable = true)
>>>>>  |-- balance: double (nullable = true)
>>>>>
>>>>> The important fields are transactiondate, transactiontype,
>>>>> transactiondescription and debitamount
>>>>>
>>>>> So using analytics. windowing I can do all sorts of things. For
>>>>> example this one gives me the last time I spent money on retailer XYZ and
>>>>> the amount
>>>>>
>>>>> SELECT *
>>>>> FROM (
>>>>>       select transactiondate, transactiondescription, debitamount
>>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>>       from accounts.ll_18740868 where transactiondescription like
>>>>> '%XYZ%'
>>>>>      ) tmp
>>>>> where rank <= 1
>>>>>
>>>>> And its equivalent using Windowing in FP
>>>>>
>>>>> import org.apache.spark.sql.expressions.Window
>>>>> val wSpec =
>>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>>>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>>>
>>>>>
>>>>> +---------------+----------------------+----+
>>>>> |transactiondate|transactiondescription|rank|
>>>>> +---------------+----------------------+----+
>>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>>> +---------------+----------------------+----+
>>>>>
>>>>> So far so good. But if I want to find all I spent on each retailer,
>>>>> then it gets trickier as a retailer appears like below in the column
>>>>> transactiondescription:
>>>>>
>>>>>
>>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>>> +----------------------+
>>>>> |transactiondescription|
>>>>> +----------------------+
>>>>> |  SAINSBURYS SMKT C...|
>>>>> |  SACAT SAINSBURYS ...|
>>>>> |  SAINSBURY'S SMKT ...|
>>>>> |  SAINSBURYS S/MKT ...|
>>>>> |  SACAT SAINSBURYS ...|
>>>>> +----------------------+
>>>>>
>>>>> If I look at them I know they all belong to SAINBURYS (food retailer).
>>>>> I have done some crude grouping and it works somehow
>>>>>
>>>>> //define UDF here to handle substring
>>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
>>>>> s.substring(start, end) }
>>>>> var cutoff = "CD"  // currently used in the statement
>>>>> val wSpec2 =
>>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>> cutoff)-1))
>>>>> ll_18740868.where($"transactiontype" === "DEB" &&
>>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>>>
>>>>> However, I really need to extract the "keyword" retailer name from
>>>>> transactiondescription column And I need some ideas about the best way of
>>>>> doing it. Is this possible in Spark?
>>>>>
>>>>> Thanks
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Extracting key word from a textual column

Reply via email to