[ 
https://issues.apache.org/jira/browse/SPARK-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350120#comment-16350120
 ] 

Arseniy Tashoyan commented on SPARK-23269:
------------------------------------------

Sure I can query, but this requires join with transactions dataset followed by 
groupBy. Additional reshuffling.

For reference, here is how I find last occurrences for patterns:
{code}
(frequentPatterns crossJoin transactions)
  .filter { row =>
    val transactionItems = ... // items from transactions
    val patternItems =  ... // items from frequentPatterns
    containsAll(transactionItems, patternItems) // Keep only rows where a 
transaction has all items of a pattern
  }
  .select("items", "freq", "transaction_timestamp")
  .groupBy("items")
  .agg(
    last("freq", ignoreNulls = true) as "freq",
    last("timestamp", ignoreNulls = true) as "lastOccurrence"
  )
{code}
 

> FP-growth: Provide last transaction for each detected frequent pattern
> ----------------------------------------------------------------------
>
>                 Key: SPARK-23269
>                 URL: https://issues.apache.org/jira/browse/SPARK-23269
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.2.1
>            Reporter: Arseniy Tashoyan
>            Priority: Minor
>              Labels: MLlib, fp-growth
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> FP-growth implementation gives patterns and their frequences:
> _model.freqItemsets_:
> ||items||freq||
> |[5]|3|
> |[5, 1]|3|
> It would be great to know when each pattern occurred last time - what is the 
> last transaction having this pattern?
> To do so, it will be necessary to tell FPGrowth what is the timestamp column 
> in the transactions data frame:
> {code:java}
> val fpgrowth = new FPGrowth()
>   .setItemsCol("items")
>   .setTimestampCol("timestamp")
> {code}
> So the data frame with patterns could look like:
> ||items||freq||lastOccurrence||
> |[5]|3|2018-01-01 12:15:00|
> |[5, 1]|3|2018-01-01 12:15:00|
> Without this functionality, it is necessary to traverse the transactions data 
> frame with the set of detected patterns and determine the last transaction 
> for each pattern. Why traverse transactions once again if it has been already 
> done in FP-growth execution?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to