[ https://issues.apache.org/jira/browse/SPARK-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350120#comment-16350120 ]
Arseniy Tashoyan commented on SPARK-23269: ------------------------------------------ Sure I can query, but this requires join with transactions dataset followed by groupBy. Additional reshuffling. For reference, here is how I find last occurrences for patterns: {code} (frequentPatterns crossJoin transactions) .filter { row => val transactionItems = ... // items from transactions val patternItems = ... // items from frequentPatterns containsAll(transactionItems, patternItems) // Keep only rows where a transaction has all items of a pattern } .select("items", "freq", "transaction_timestamp") .groupBy("items") .agg( last("freq", ignoreNulls = true) as "freq", last("timestamp", ignoreNulls = true) as "lastOccurrence" ) {code} > FP-growth: Provide last transaction for each detected frequent pattern > ---------------------------------------------------------------------- > > Key: SPARK-23269 > URL: https://issues.apache.org/jira/browse/SPARK-23269 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.1 > Reporter: Arseniy Tashoyan > Priority: Minor > Labels: MLlib, fp-growth > Original Estimate: 120h > Remaining Estimate: 120h > > FP-growth implementation gives patterns and their frequences: > _model.freqItemsets_: > ||items||freq|| > |[5]|3| > |[5, 1]|3| > It would be great to know when each pattern occurred last time - what is the > last transaction having this pattern? > To do so, it will be necessary to tell FPGrowth what is the timestamp column > in the transactions data frame: > {code:java} > val fpgrowth = new FPGrowth() > .setItemsCol("items") > .setTimestampCol("timestamp") > {code} > So the data frame with patterns could look like: > ||items||freq||lastOccurrence|| > |[5]|3|2018-01-01 12:15:00| > |[5, 1]|3|2018-01-01 12:15:00| > Without this functionality, it is necessary to traverse the transactions data > frame with the set of detected patterns and determine the last transaction > for each pattern. Why traverse transactions once again if it has been already > done in FP-growth execution? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org